Home About me CS Humanities Math Modeling Spanish Physics STEM I STEM II

STEM

Course Description

In this course, students develop technical scientific research and writing skills by working on an five month long independent research project. We present our findings at the school wide science fair in February.

Predicting Vitiligo Antigens Utilizing Machine Learning

Abstract: Vitiligo is a B and T lymphocyte-mediated autoimmune condition resulting in destruction of melanocytes, manifesting as depigmented patches of skin with a lack of melanocytes. Approximately 1% of the global population is affected. Vitiligo is associated with cochlear dysfunction, inner ear diseases, diabetes mellitus, thyroid disease, metabolic syndrome, among others (Wang et al., 2024). Various studies report antibodies directed against melanocyte-specific protein autoantigens, such as tyrosinase, leading to its destruction (Faraj et al., 2021). Currently utilized immunosuppressive treatments, such as corticosteroids, have demonstrated some efficacy. However, due to its non-specific nature, patients often suffer from various side effects and potential harm (Wang et al., 2024). To address these critical gaps, we propose to identify patient-specific melanocyte antigens by developing Python-based machine learning from an existing database (Gupta et al., 2019). Identified neo-antigens can be targeted to develop personalized therapy. Once trained, the machine learning program will be validated through MHC-I and MHC-II binding predictions. The current dataset includes gene RPKM values for approximately 9902 genes associated with vitiligo. We plan on using these values, combined with their z-score deviation from normal red blood cells, to predict the gene with the strongest association to vitiligo. Three sample melanocytes in the dataset have p-values less than 0.01, which make them statistically significant. We will feed this data into a series of LSTMs for an output of genes and correlation values. Our findings will build a foundation for disease-specific machine learning tools, aimed at identifying actionable drug targets.

Keywords: vitiligo, T-cells, autoimmune disease, neo-antigens

Project Graphical Abstract

Research Proposal

Engineering Problem: Vitiligo is an autoimmune skin condition characterized by the loss of melanocytes, which is triggered by immune CD8+ T-cell recognition of neoantigens. Identifying these neoantigens is critical for developing therapies and more personalized treatments. Traditional experimental approaches are resource-intensive and often have severe side effects.

Engineering Objective: The goal of this project is to identify protein sequences to predict the neoantigen-blocking peptide or agent for initial immune recognition in vitiligo through a machine learning algorithm.

Problem Graphical Abstract

Background

Imagine waking up one day and finding out that patches of your skin have turned white. Researching more into the issue, you find that the condition, vitiligo, has no cure. Even worse, you learn that it will always continue to spread, with treatments only helping to mitigate the process. Today, over 70 million people worldwide are affected by vitiligo, an autoimmune condition where patches of skin lose pigment. This happens because the body’s immune system, specifically B and T lymphocytes, attacks melanocytes, as shown in figure 1. These are cells that produce melanin, a pigment that gives skin its color. There are two main types of vitiligo: segmental and non-segmental. Non-segmental is more common and spreads slowly on both sides of the body. Segmental causes rapid color loss in one side of the body. Although no definitive cure exists, scientists have identified various risk factors and genes related to this immune response. Vitiligo itself is not life threatening, but it affects the quality of life for individuals who have it by weakening the immune system and increasing the susceptibility of sunburn and other conditions such as ear infection in patients. Current treatments include topical treatments such as Corticosteroids, Calcineurin inhibitors, photo treatments like Psoralen and UVA therapy, or surgical treatments like skin grafting (Bergqvist & Ezzedine, 2020). Some new and popular treatments include advanced treatment medical products (Ghashghaei et al., 2023) and gene therapies like JAK inhibitors shown through ruxolitinib cream (Passeron et al., 2024). Another therapeutic method is Cas9 gene therapy using CRISPR for different autoimmune diseases (Lee et al., 2022). However, gene therapies often completely inhibit immune response by “knocking out” genes, leading to various severe side effects and potential harm. Similarly, the molecular triggers that initiate this immune response—potentially in the form of neoantigens—remain poorly understood. Neoantigens are mutated peptides highly specific to individuals that elicit immune response. In cancer, neoantigens are generated by neoplastic cells, which are targeted by immunotherapy. Cancer immunotherapies are rising in popularity, as they demonstrate high efficacy with improved survival (Lu & Robbins, 2015). In the context of vitiligo, melanocytes generate self-antigens, which trigger immune response, leading to their destruction. These neoantigens are thought to be produced when melanocytes encounter oxidative stress. These cells are particularly vulnerable to somatic mutation due to a compound called ROS (reactive oxygen species) produced during melanin production (Faraj et al., 2021). Exposure to environmental stressors such as UV, certain chemicals, and pollutants further contributes to oxidative stress. Currently, several antigens associated vitiligo include VIT 90, 75, 40, gp100, MART1, and tyrosinase (Cui et al., 1995). Stress induced proteins like Heat Shock Protein 70 (HSP70i) have also been considered as enhancers of immune response (Schmidt, 2020). However, the “trigger” antigen, or the antigen that initiates autoimmune response is not yet known. If identified, specific treatment can be developed as curative treatment. Moreover, due to its specificity, the adverse effects would be minimal. Despite significant advances in understanding vitiligo's pathogenesis, critical knowledge gaps remain regarding these specific trigger antigens. Current studies have identified melanocyte-associated proteins as potential immune targets, but these findings do not explain the variability in patient-specific immune responses. Additionally, the role of stress-induced neoantigens, which may arise from oxidative damage, is poorly characterized, leaving a large portion of the antigen field of vitiligo unexplored (Faraj et al., 2021). Machine learning (ML) is a branch of Artificial Intelligence that develops algorithms to make predictions based on data pool. Neoantigen prediction is a rising field due to its potential impact for developing effective therapy for various diseases (Cai et al., 2023). ML enables integration of large-scale datasets to identify patterns and correlations. Furthermore, machine learning-driven approaches account for patient-specific variability, providing personalized treatment options, and identifying antigens driving responses for different individuals, making it a promising avenue. The goal of this project is to identify immunogenic neoantigen specific to individual patients, perform peptide sequencing, and identify actionable drug targets through a machine learning algorithm. Python-based coding and bioanalytic programs will be utilized to find and analyze protein sequences from the existing clinical data. We hypothesize that identified neoantigen will be specific for each patient. Moreover, the group of peptides identified in vitiligo patients would be significantly different from healthy individuals. Based on our findings, we may identify effective therapeutic agents that will improve the quality of life of patients diagnosed with vitiligo.

Background Graphical Abstract

Fig. 1: This figure displays the risk factors and reactions involved with antigens and T cells in the pathogenesis of vitiligo (Bergqvist & Ezzedine, 2020).

Procedure

I used Python, Anaconda, IEDB Analysis Resource, and Excel for my project. An LSTM was coded in Visual Studio Code to predict casual genes based off of a dataset.

Normal Probability Plot

Figure 2: This figure shows the normal probability plot of the dataset. The data is skewed to the left, since many of the values have a high z-score in comparison to the sample percentile.

Regression Statistics

Figure 3: This figure shows the regression statistics of the dataset. The R Square is around 20 percent, meaning the data is representative of around 20% of vitiligo patients.

Analysis: The Vitivar dataset will be used to predict the correlated genes. This dataset includes the RPKM values for 4 sample melanocytes, as well as the specific z-scores comparing RPKM from vitiligo melanocytes and normal red blood cells for 22582 genes found in melanocytes. The two figures below show the results of a regression analysis of the dataset. From these two figures, we can conclude that additional datasets will be needed to produce a fully representative model. Although the data is skewed, it may not need to be adjusted since we are searching for a casual antigen. This data will be used to train the various LSTMs, which make data collection a crucial part of the project.

Discussion/Conclusions: Currently, the machine learning model is still being developed and will output a result soon. I plan that both objectives will be successfully accomplished. We expect data collection may be difficult and potentially not reliable. Thus, the model must be made to be representative and reproducible with different data sets. An alternative to finding data online would be to research and obtain data in a lab setting; however, this can be difficult due to time and space limitations. Another potential limitation is that IEDB Analysis Resource can be unreliable, and findings could vary based on genes. Thus, we will use alternative machine learning algorithms and prediction tools to ensure that our findings are accurate. For much of my process, there were many errors with the code and my computer. A lot of time was spent uninstalling and reinstalling Java, as well as discussing with professionals and mentors. Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Squared Error (MSE), R-squared (R²), Accuracy (for classification tasks), and loss function values were used as statistical tests to evaluate the machine learning model. These tests were used to find how accurate the model was in comparison to the given dataset. My work is similar to past studies in that it addresses similar concerns with attempting to address side effects of current treatments. Several studies currently work on treating stress related proteins to treat vitiligo, which is like what I am proposing. My research differs because no one has yet to target antigens. The neo-antigen hypothesis for autoimmune conditions that supports my project was published in the past year, meaning my research is novel and can provide preliminary insight into how neo-antigens could potentially play a role in the development of vitiligo. My research will improve understanding of how vitiligo develops and will shed light on its etiology.

References