STEM I

~ Dr. Crowthers

STEM I with Science and Technical Writing encompasses the six-month-long independent project I conducted related to organ rejection. In this course, I learned the steps to brainstorm ideas, develop solutions, and test for results. Specifically, I learned these skills through my computer science project that aims to predict organ rejection and find precise targets for immunosuppression. Scroll down to read about the independent research project I have completed through this course.

stem1

Abstract

Organ rejection is a dangerous medical complication that can occur after an organ transplant. Currently, all transplant patients are prescribed life-long immunosuppressors to decrease the risk of organ rejection. However, these medications can increase the susceptibility to other infections and cancers. Human leukocyte antigen (HLA) mismatches between donors and recipients can initiate T-cell activation, which is known to be the primary mediator of organ rejection. However, HLA genes are very polymorphic, and classifying “whole” HLA mismatches does not account for the minor amino acid differences that can start rejection. One solution is to create a machine-learning model that can analyze donor and recipient HLA sequences to predict MHC-peptide complexes, which are the molecules that T-cells recognize to start an immune response. This information can be used to predict rejection and find precise targets for immunosuppression. The project used datasets with MHC class I-peptide binding information to analyze donor and recipient HLA sequences. The result is that the model can accurately predict MHC-peptide complexes and rejection targets, with an R^2 value of 0.723. In conclusion, focusing on MHC-peptide presentation can account for HLA polymorphism and is more accurate in predicting organ rejection. Additionally, this data can be used to administer personalized and targeted immunosuppressors or decrease the need for broad immunosuppressors altogether. In the future, a similar model can be developed to predict antibody-mediated rejection (AMR) using MHC-class II datasets and be modified to support other organ transplants.

Keywords: Organ rejection, immune system, cytokines, T cells, peptides, machine learning

Graphical Abstract

stem1

Supporting Documents

Please click here to view my supporting documents


Problem Statement

Chronic organ rejection affects about 50% of kidney transplants five years post-transplant. Due to chronic rejection occurring over a long period, there are limited methods to diagnose and treat chronic rejection. Even though Human Leukocyte Antigen (HLA) mismatches can cause rejection, HLA genes are very polymorphic, and classifying “whole” HLA mismatches does not account for the allele differences that can start rejection.

Engineering Objective

The objective is to make a machine learning model that can predict rejection and provide specific targets that will cause rejection, given donor and recipient HLA sequences. The model will work by predicting the MHC-peptide complex on the donor organ by focusing on the specific HLA allele mismatches. Ideally, this model will use the mismatches to provide information on targets for personalized immunosuppression.

Background Infographic

stem1

Background

Organ transplants are among the greatest advances in modern medicine, saving tens of thousands of lives every year. By increasing life expectancies and improving the quality of life, they remain the best therapy for terminal and irreversible organ failure (Grinyó, 2013). However, there is currently a major problem in the organ transplant industry: the demand is vastly greater than the supply. Due to a lack of organ donations, about seventeen people die each day while waiting for an organ transplant (Organ, Eye and Tissue Donation Statistics, n.d.). The immense demand emphasizes that every donated organ has the potential to change lives, and it is crucial to maintain the long-term health of each organ, for the sake of the patient and the organ as well.

Overview of Organ Rejection

Even if a patient is successful in receiving an organ transplant, many medical complications may occur after the transplant, the most common being organ rejection. The immune system is a body system that destroys foreign cells to protect the body from harm. In the case of organ rejection, the immune system recognizes the transplanted organ as foreign and attempts to attack it by producing cells or antibodies that invade the organ (Understanding Transplant Rejection | Stony Brook Medicine, n.d.). Currently, all transplant patients are prescribed immunosuppressors to decrease the risk of organ rejection. However, recipients must take immunosuppressive drugs for their entire lives for their bodies to accept a donated organ. While these medications prevent organ rejection to an extent, they can severely weaken the immune system, increasing the risk of cancer, infections, and other diseases (Kelly, 2022). Additionally, immunosuppressors are not as effective in decreasing the risk of chronic rejection, which is often irreversible and can lead to graft failure or death (Hunt & Saab, 2012). By five years post-transplant, chronic rejection affects up to 50% of kidney transplants (Gautreaux, 2017). Since chronic rejection is often asymptomatic and occurs over an extended period, the common treatment method is to increase the dosage of immunosuppressive drugs, which can exacerbate the dangerous side effects. New treatments are necessary to prevent organ rejection without using broad immunosuppressors that weaken the entire immune system.

MHC-Peptide Presentation

Early chronic organ rejection is primarily caused by T-cell-mediated rejection (Chong, 2020). T-cells are a type of immune cell that play a crucial role in identifying and eliminating foreign cells. When T-cells misinterpret donated organ cells as foreign, it can lead to T-cell activation and an attack on the transplanted organ. MHC peptide presentation plays a vital role in T-cell activation and can lead to developing strategies to prevent transplant rejection. The major histocompatibility complex (MHC) is a group of genes that code for MHC molecules found on the surface of cells. These molecules play a vital role in the immune system’s ability to distinguish between “self” and “non-self” (General, Non-Specific Defenses Against Infection, n.d.). There are two main types of MHC molecules: MHC class I and MHC class II molecules. While MHC class I molecules are found on all nucleated cells, MHC class II molecules are only present on antigen-presenting cells (Lakna, 2018). Nonetheless, the main function of all MHC molecules is to bind peptide fragments derived from pathogens (or donor cells) and display them on the cell surface for recognition by the appropriate T cells (Hewitt, 2003). If T-cell receptors (TCRs) recognize a peptide from the transplanted organ on an MHC molecule, it activates, starting the immune response against the transplanted organ.

Indirect Allorecognition

Antigen presentation can occur through direct or indirect pathways. However, chronic rejection is primarily mediated by the indirect pathway (Siu et al., 2018). As donor organ cells die and are replenished, the damaged donor cells shed MHC molecules. The MHC molecules are taken up by the recipient antigen-presenting cells (APCS), which break down donor MHC molecules into smaller, peptide fragments (Mak et al., 2014). These peptides are loaded onto recipient MHC class II molecules and are presented on the surface of recipient APCs (SITNFlash, 2012). If there is a significant mismatch in the peptides displayed and the recipient’s MHC molecules, naïve T-cells may recognize the peptide complex displayed on APCs as foreign, starting an immune attack against the donor organ (Mak et al., 2014).

Tissue Typing and Immune Profiling

When looking for organ matches, doctors perform Human Leukocyte Antigen (HLA) typing to understand the similarity in antigens between the donor and the recipient. The HLA is a group of genes that provide instructions to make antigens present on the surface of cells (Manski et al., 2019). Six specific HLAs are looked for, and a higher similarity results in a likely chance of an organ match (Matching and Compatibility | Transplant Center | UC Davis Health, n.d.). However, HLA genes are the most polymorphic genes in the human genome. This means that HLAs have many different allele combinations, and their variant alleles have high degrees of sequence similarity. The similarity can be difficult to establish with current serological and low-resolution tests (Dasgupta, 2016). Therefore, understanding the exact differences in HLAs between the donor and recipient can result in a better treatment method that is personalized and accurate for the recipient.

Benefits of Machine Learning

Machine learning is a subset of artificial intelligence that uses statistical techniques that allow computer systems to automatically learn and develop from experience without being explicitly programmed (Costa, 2019). Previous studies have employed machine learning techniques to sift through massive datasets of gene expression data. Machine learning algorithms can analyze data to identify patterns and establish relationships from complex datasets. For this project, machine learning would allow HLA sequence data to be used to make a prediction model. By training the model on datasets of HLA sequences and peptide binding affinities, the algorithm can predict these complexes with high accuracy, paving the way for personalized and targeted immunosuppression. There have been many studies that employ machine learning to predict organ rejection. However, those models focus on “whole” HLA mismatches, which do not account for HLA polymorphism or the peptide sequences. Therefore, by focusing on HLA sequences and peptides, a more accurate and robust model can be created to prevent organ rejection. This way, we can protect the patient and the organ from harm.

Procedure Infographic

stem1

Procedure

Data Collection and Preprocessing:

HLA Protein Sequences. The Immuno-Polymorphism Database (IPD-IMGT/HLA)bversion 3.55.0 from the European Bioinformatics Institute (EBI) was accessed through the database’s public FTP site hosted by the EBI. The database provides a central repository for sequences of HLA alleles, including the protein sequences in the FASTA format. HLA allele sequences were filtered to only include the commonly typed HLA loci: HLA‐A, ‐B, ‐C, ‐DRB1, ‐DRB3, ‐DRB4, ‐DRB5, −DQA1, ‐DQB1, ‐DPA1 and ‐DPB1 (Hamed et al., 2018). The alleles were converted into field type two resolutions, as higher resolution typing does not affect the amino acid sequence of the protein (Kramer et al., 2020).

Study Cohorts. The STAR files were obtained by the United Network for Organ Sharing (U.N.O.S.), which include donor and recipient transplant data dated back to 1987. The large dataset was processed, resulting in a small, manageable dataset with living kidney transplantations. The dataset contains past donor and recipient HLA alleles along with the rejection outcome. Chronic rejection was defined as rejection episodes that occur at least one year after the transplant (Vaillant & Mohseni, 2023).

HLA-Epi is another model that calculated the epitopic mismatch load between potential recipient-donor pairs. The HLA-Epi dataset contains donor and recipient HLA alleles along with their calculated compatibility scores (Geffard et al., 2022). Even though the model focuses on direct allorecognition, the compatibility scores can be used to validate the proposed model’s performance through regression models. Additionally, they have scores calculated by the PIRCHE-II model for the same donor and recipient alleles. The PIRCHE-II model is another algorithm to predict indirectly recognizable HLA epitopes (Geneugelijk & Spierings, 2020). The PIRCHE-II model does not consider solvent-accessible mismatches. Therefore, the scores in the HLA-Epi dataset can be used to compare the performance of the proposed model with competitor models.

Bioinformatics Servers:

Bioinformatic servers were used to analyze and compare the amino acid sequences of donor and recipient HLA alleles. NetSurfP from the Danmarks Tekniske Universitet (DTU Health Tech) was used to predict the surface accessibility of individual amino acids in an amino acid sequence. Additionally, NetMHCIIpan from DTU Health Tech was used to predict the binding affinity and eluted ligand of donor HLA peptides to recipient HLA class II alleles.

Software and Software Packages:

Google Colaboratory was used to code the machine learning models, as it is a hosted Jupyter Notebook to write and execute Python code through the browser. Microsoft Excel was used to format the data in a table format to make it easier to upload as a data frame into Google Collab. The HLA Epitope Mismatch Algorithm (HLA-EMMA) was used to validate amino acid mismatch results. Python libraries such as “Pandas” were used to import Excel data files, and “NumPy” was used to support the large arrays in the data files. Additionally, “MatplotLib” was used to visualize data, and “Seaborn” was used to create a confusion matrix. Lastly, the Statistical Analysis System (SAS) software will be used to convert the U.N.O.S. data files into a readable Excel file.

Modified Needleman-Wunsch Algorithm:

The IPD/IMGT-HLA database has allele sequences in different lengths. However, to find the amino acid mismatches, the sequences must be of equal length to be vertically aligned. Therefore, a modified Needleman-Wunsch algorithm was used to make the sequences have equal lengths. The Needleman-Wunsch algorithm is a common global alignment method that uses a scoring matrix and dynamic programming to find the optimal alignment between two sequences (Mittal, 2024). The traditional algorithm adds gaps between the protein sequences, representing the evolutionary changes between the two sequences. The gaps attempt to optimize the alignment score and reveal any mutations, insertions, or deletions that may have occurred over time (NandiniUmbarkar, 2020). However, to find the amino acid mismatches between the donor and recipient FASTA sequences, there should not be any additional modifications to the sequence. Therefore, the model uses a similar scoring system but has a very high gap penalty. The gap penalty is a negative score that is added to the score any time a gap is inserted in the sequences (Mount, 2008). By having a high negative gap penalty, the overall score will significantly decrease. To have a high alignment score, the sequences will not be modified.

Finding Solvent-Accessible Amino Acid Mismatches

Solvent-accessible amino acids are amino acids in a protein that are exposed to the solvent surrounding the protein. These amino acids have a much higher chance of being recognized by T-cells. Therefore, NetSurfP was used to predict the solvent accessibility for each amino acid in the donor alleles, and the solvent-accessible amino acids which also contained amino acid mismatches were stored for peptide analysis.

Generating Donor-Derived Peptide Chains

NetMHCIIpan was used to generate donor-derived peptides that were 15 amino acids in length. The binding affinity and eluted ligand were found for all generated peptides to find the strongest peptides. The donor alleles were used for peptide sequence generation, and the molecules were input as the recipient MHC class II molecules.

Filtering Peptides With Binding Affinity and Eluted Ligand

The eluted ligand score is the likelihood of a peptide being an MHC ligand, while binding affinity is the strength of attraction between the peptide and the molecule (Wongklaew et al., 2024). NetMHCIIpan reports the strongest binding peptide sequences to each MHC class II molecule. Out of those, the peptides containing the solvent-accessible amino acid mismatches were stored as the most significant peptides that may cause rejection.

Machine-Learning Model Training and Testing:

After the model is completed, the HLA-Epi data will be used to create regression models between the predicted compatibility score and the true compatibility score. The donor and recipient samples be run through the model, and the predicted scores will be recorded. Then, the true scores of the respective samples will be matched with the predicted score from the model. Regression models will be made to validate the model’s ability to accurately predict a score for a sample on a scale. The model will be improved until it reaches an accuracy of at least 70% or greater. If needed, feature selection algorithms such as random forest will be used to find the most influential HLA alleles, which can improve the accuracy of the regression models.

Table 1: Mismatches and Solvent Accessible Mismatches for Donor Allele B*07:02

Figure 1: Significant peptides for Donor allele B*07:02.
Blue is binding affinity score and red is eluted ligand scores.

Figure 2: Box and Whisker Plot of Scores for
Rejection and No-Rejection Groups

Figure 3: Decision Matrix for Compatibility
Score Regression Models

Figure 4: Scatter Plot for Ridge Regression
Model after Feature Selection

Analysis

Chronic organ rejection is a dangerous and prevalent medical condition after a transplant, and a model was created to identify minute differences between donor and recipient HLA alleles to predict rejection and find targets for precise immunosuppression. The amino acid differences that cause rejection are different for every transplant, and so should the medications.

Based on these findings, it can be determined that predicting MHC peptide complexes can be used to predict rejection. Focusing on amino acid differences between donor and recipient sequences provided a more accurate understanding of the specific peptides that had a higher chance of immunogenicity. Additionally, unique mismatches were important in reducing the number of features the model would use. For example, all the mismatches from Table 1 contained the amino acid mismatches in the donor that were not present in either of the recipient alleles, which allowed only significant mismatches to surface and influenced the peptide selection. While NetMHCIIpan showed multiple strong binding peptides, only the ones that contained solvent-accessible mismatches were stored. Out of all the string peptides, most of them contained solvent-accessible peptides. Additionally, multiple strong peptides contained many of the same amino acid positions. Additionally, donor and recipient alleles with more strong peptides contained a greater number of solvent-accessible mismatches. It is probable that because of the higher mismatches, there was a higher number of strong peptides, showing evidence for those alleles being more immunogenic. In both cases, many of the peptides repeated for both the recipient alleles, which again shows evidence for using peptides to find immunosuppressive targets as repeated peptides have a higher chance of initiating an immune response.

Additionally, the results from the UNOS dataset shows a clear correlation between higher scores for rejection samples, and lower scores for non-rejection samples. Performing a two-sample t-test resulted in the difference in mean scores between the groups as statistically significant. The significance reinforces the model’s ability to present different scores based on the rejection outcome. As peptides are counted for the targets, a greater number of peptide possibilities corresponds to a greater chance of rejection. However, as everyone has at least some difference in their DNA, it is more beneficial to understand the compatibility score of a specific recipient and donor combination, as the box and whisker plots have a significant overlap in scores. Therefore, comparing the model’s scores to already tested compatibility scores can give us more insight into the model’s accuracy.

Regression models were created and analyzed to find the correlation between the model’s scores and true compatibility scores. The ridge regression model had performed the best, with an R^2 value of 0.626. However, to get the desired accuracy, finding the most influential HLA alleles can aid in making the model more accurate. By finding the most influential HLA allele types, a greater weight can be added to those alleles. After conducting a random forest feature selection, HLA-A and HLA-B were found to be the most important HLA types. By giving those alleles the greatest weightages, the R^2 value increased to 0.723. The increase in accuracy shows evidence for those alleles being the most influential in the rejection outcome, and clinicians should make an effort to match donor and recipients with a high similarity in those alleles.

In the end, all the objectives were accomplished, as the result presented peptides, suggesting that they can be used as immunosuppressive targets. Potential limitations would include testing the model clinically. However, validating our results with current models, such as HLA-EMMA can provide more confidence in our methods and results.

Future Research

Future research would include creating models that could support other organ transplants, such as heart, lung, or liver. Additionally, a similar model could be created by focusing on the direct pathway or antibody-mediated rejection. There is also work that can be done to optimize the machine learning algorithms, including adding more features or testing the model with external datasets. Similarly, training the model with more patient information, such as age, weight, and family history, could potentially improve the model by using more patient features. These studies could improve donor selection and decrease the need for immunosuppressors. In short, the endless future research opportunities have the potential to revolutionize the healthcare industry from its current state today.

Discussion and Conclusion

The ultimate objective of this project was to create a machine-learning model that can predict the risk of rejection, given donor and recipient HLA sequences, by finding the most significant peptides. Amino acid sequence data was obtained from the IPD/IMGT-HLA database, and a sample donor and recipient HLA sequences were obtained from U.N.O.S. Using Google Colab, the amino acid mismatches were identified, and NetSurf P was used to find the solvent-accessible mismatches. These mismatches have a higher chance of being recognized by recipient T-cells because they are exposed to the solvent in the peptide. Then, NetMHCIIpan was used to generate donor-derived peptides and calculate the binding affinity to the recipient alleles. Strong binding peptides that contained the solvent-accessible peptides were stored as the peptides that have the highest chance of being immunogenic. After analyzing the results, it was evident that there was a correlation between solvent-accessible mismatches and the number of strong peptides that were present, with a greater number of strong peptides correlating with a higher chance for rejection.

Additionally, many of the peptide sequences had overlapping positions or were in a the sequence region with multiple amino acid mismatches. For example, in the donor allele sequence B*35:03, the peptide sequence TQFVRFDSDAASPRT was predicted to strongly bind to multiple recipient MHC molecules. Additionally, all of the peptides in the donor allele sequence B*07:02 predicted to bind to the recipient were extremely similar, having moved one or two amino acid positions in the sequence. This supports the conclusion that similar peptide sequences are likely to cause rejection, as they have a higher chance of binding to multiple recipient alleles. Validating the results with HLA-EMMA supports the proposed methodology, and the model can be improved in the future by including more features. The amino acid differences that cause rejection are different for every transplant, and so should the medications. With this model, we can not only keep the organ safe, but keep the patient healthy throughout their life.

References

February Fair Poster