Home About

STEM I

STEM I is taught by Dr. Crowthers. Students in this course engage in a six-month long independent research project that involves reading literature, making hypotheses, designing/conducting experiments, and communicating their results. This culminates with the school-wide February STEM Fair, where students have the opportunity to present their work to judges working in related industries.

Using Immune Footprints in a Novel Deep Learning Model to Detect Human Diseases

Patent Pending

U.S. Provisional No. 63/439,655

NSF Logo

This project involves the design of a novel disease diagnosis method using the genetically sequenced antibodies from a patient and machine learning. When a patient has their blood drawn, their antibodies can be digitally catalogued as amino acid sequences, which can be analyzed by numerous machine learning models to see if any antibodies from the patient are similar to previously known antibodies associated with a disease. This will allow for rapid, simultaneous disease detection in a way that can revolutionize the healthcare industry.

Abstract

Graphical Abstract

Wherever humans have traveled throughout history, diseases have always seemed to follow. Luckily, humans have developed a robust defense mechanism known as the immune system, where antibodies are a key player. For antibodies to be effective, they must bind to the surfaces of foreign diseases with highly variable shapes and are able to do so through recombination processes that make each antibody unique. This uniqueness allows for a correlation to be established from an antibody to its corresponding disease. While there have been previous attempts to correlate diseases using feature-based machine learning, the direct use of amino acid sequences in a deep learning model remains to be explored. Here, we propose a language modeling-based approach for classifying disease-specific antibodies against a healthy control set. Using the pre-trained ProtBERT-BFD model from Rostlab, we were able to generate an embedding vector with 1024 values for each amino acid in an antibody sequence. These values were then averaged across every amino acid to obtain a single “sentence-embedding vector” that was passed to a feedforward neural network of progressively smaller layers. Lastly, the neural network would return a value between 0 and 1 representing likeness to a healthy or disease-specific antibody. Such binary models were built for COVID-19, HIV, and CLL, achieving accuracies of 91.85%, 93.92%, and 97.26%, respectively. These models can then be combined to allow simultaneous, multi-disease diagnosis with the potential to support hundreds of diseases, carrying with it immense ramifications for future disease testing.

Research Proposal

Click here to access supporting project documents

Problem Statement

Many diseases remain difficult, expensive, or slow to diagnose (Sujena et al., 2022). However, the immune system naturally carries disease “footprints” (National Cancer Institute, 2021) in the form of antibody sequences. With recent advancements in next-generation sequencing and deep learning, these “footprints” can be sequenced and analyzed to provide diagnostic information at a scale never seen before.

Project Goal

To design a deep learning model capable of predicting Chronic Lymphocytic Leukemia (CLL), COVID-19, and other diseases based on antibody sequences in a patient’s peripheral blood cells which serve as genetic “footprints” in the immune system.

Background

Background Infographic

The immune system defends against potential threats by employing coordinated responses and special cells throughout the body. One such cell is the B cell, a type of white blood cell critical in defending against viral or bacterial threats, also known as pathogens. More specifically, B cells secrete special proteins called antibodies which in turn bind to antigens, the receptors on a pathogen (Henochowicz, 2022). However, since each pathogen differs in terms of shape and size, their antigens will differ as well. Thus, antibody binding regions must be easily mutable in order to fit the unique antigens of every conceivable disease (Janeway et al., 2001). Knowing that each antibody binds to only one specific antigen, and that each antigen can be correlated with a specific pathogen (Janeway et al., 2001), a correspondence can then be established from a given antibody to the pathogen it targets. Moreover, a given B cell can only synthesize one antibody variant out of the ten million possible combinations, with instructions for the specific variant delivered by helper T cells, another type of immune cell (Alberts et al., 2002). Therefore, the genetic code (DNA) obtained from a B cell can serve as a unique genetic “footprint” for its secreted antibodies. This then poses the question: can an antibody footprint be used to identify its target disease? With recent advancements in deep sequencing and deep learning, it has now become possible to unlock the power of these footprints.

Procedure

Methods Diagram

To start, the antibody amino-acid sequences associated with various diseases were collected from publications and online data mining sources such as GenBank. These antibodies were then processed with the pre-trained ProtBERT-BFD model from Rostlab, which outputs an embedding vector of 1024 numbers for each amino-acid in the sequence (Elnaggar et al., 2022). After creating the embeddings, they were trained using a feedforward neural network to extrapolate the patterns between antibody embeddings of different diseases. Lastly, the model was evaluated on an unseen testing dataset, where for each antibody sequence, each of the possible disease outputs would have a probability returned representing the antibody sequence's likeness to that disease. The disease with the highest probability output would serve as the final prediction, and this process was repeated for each sequence in the testing dataset.

Figure 1: Multi-Disease Model Confusion Matrix

Multi-Disease Model Confusion Matrix

Displayed here are the ensemble model’s predictions vs. truths on an unseen testing dataset. Categorically, the vertical axis represents true labels, while the horizontal axis represents predicted labels. Thus, different portions of the matrix represent certain combinations of true and predicted labels. Also, the major diagonal (top-left to bottom-right) where true and predicted labels match is containing of the correct predictions.

Figure 2: Diagnosis Confidence Levels

Confidence Levels Graph

Based on the number of antibodies predicted for a disease, the binomial distribution can be used to formulate a diagnosis confidence level for the patient as a whole.

Figure 3: Dengue vs. Healthy Embeddings

Dengue Embeddings Graph

For each the Dengue and healthy control antibodies, embeddings were generated using the ProtBERT-BFD model and condensed into a two-dimensional space as shown here.

Figure 4: COVID-19 vs. Healthy Embeddings

COVID-19 Embeddings Graph

For each the COVID-19 and healthy control antibodies, embeddings were generated using the ProtBERT-BFD model and condensed into a two-dimensional space as shown here.

Analysis

Discussion/Conclusion

References

February Fair Poster