Stem I

In STEM class, we are guided through the process of conducting original research. We started by learning what research actually is and learning how to interact with scientific literature through reading and analyzing several papers. This class is a chance for each of us to explore a question or solve a problem that we are passionate about. Through the experience of going through this class, we gain experience and confidence in conducting independent research and writing scientific literature. In addition to the scientific and engineering skills that we gain, we also improve our analytical thinking and effective communicating skills, both verbally and written.

STEM Quad Chart

Developing a Deep Learning Model to Predict the Health Risks for Individual Migrants

Overview

This project developed a novel deep learning model to predict indivudal health risks due to long term relocation. Some key challenges when conducting this project that had to be overcome were a dataset that was large, contained long term data, was migrant specific, and had detailed features. Additionally, the model had to be improved for increased accuracy which was difficult due to the temporal and causal nature of the problem. Overall, this project created a model that can accurately predict indivdual health risks, empowering individuals to make informed decisions regarding relocation, help healthcare professionals account for relocation in their guidance, and creates a scalable framework for more accurate disease prediction accounting for relocations.

STEM Visual Abstract

Currently, around 40 million people relocate each year, yet none of them have a way to factor health into their decision-making, since they do not know what their individual predicted health outcomes will be. There have been several studies showing population-level trends in migrant health, but these vary significantly depending on individual factors. So, migrants need a way to predict their individual health outcomes.

Deep learning models have been shown to handle temporal and causal predictions effectively, allowing them to accurately predict migrant health outcomes. This project addressed the problem of migrants needing a way to predict their individual health outcomes by using a deep learning model.

First, a dataset of migrant health data was curated from the PSID dataset (Gouskova et al., 2026). A benchmark was created to evaluate models on the dataset. Finally, a new model was trained and evaluated on the dataset. The benchmark results show that current publicly available models cannot accurately predict individual health outcomes for migrants. However, the model outperforms existing models in predicting individual migrant health outcomes across multiple evaluation metrics. The results highlight the importance of incorporating individual-level and temporal factors in modeling health outcomes among migrants. In conclusion, this study developed a model that accurately predicts health outcomes for migrants, demonstrating that deep learning approaches can provide a scalable, effective framework for migrant health prediction.

Research Proposal

Problem Statement

Individuals considering a long-term relocation need a way to understand the potential health impacts, so they can make an informed decision.

Objective

The objective is to develop a model that can predict individual disease risks and health outcomes for people who grew up in one environment and relocated to another.

Background

STEM Visual Abstract

There are 40 million people who relocate from one place to another each year. Current research on the health changes of these migrants shows that they generally experience initially imporoved health but then face the risk of worsened health as time passes (Paez-Deggeller, 2025). However, all the migrant health studies so far have been population level. This is a problem because it doesn't help individuals gain an insight into their individual potential health risks.

On the other hand, there has been a recent shift toward personalization in the health care industry, with increase personalized treatments and risk diagnosis (Johnson et al., 2021). Deep Learning models have been used with Electronic Health Data to predict diseases for individuals (Amirahmadi et al., 2023). This presents an opportunity for migrant health predictions.

Methodology

STEM Visual Abstract

First, a dataset was curated. To do this, the Panel Study of Income Dataset was downloaded (Gouskova et al., 2026). Then it was processed and cleaned to remove missing values and add detailed labels for column headers.

Then, to prepare the dataset for training, the variables had to be defined. A novel continous health score was created based on the several health features in the dataset. This allowed the model to predict one continous risk change over time rather than sparse categorical features. Next, a relocation event was defined based on the features in the dataset. Finally, the data was converted to temporal sequences to train a deep learning model.

Next, a benchmark was created to evaluate several model structures on the dataset. Finally, the new model was trained and evaluated on the dataset. A LSTM model with a multi-layer, bidirectional architecture and dropout regularization was trained on the dataset using a sequence-to-sequence framework that was designed to guide the model to understand and accurately predict health outcomes following relocation events. The model was optimized using the AdamW optimizer and a gradient clipper was applied. Training was conducted for 200 epochs using shuffled mini-batches with a masked Smooth L1 (Huber) loss. This new model was evaluated using several prediction error metrics and compared to the benchmark models using Root Mean Square Error.

STEM Visual Abstract

Results

STEM Visual Abstract

Figure 1: This figure shows the comparison of the actual change in health score to the model’s predictions. Overall, the model shows the capability to understand the change in health score as the points generally match the trend of the actual health score. However, the model struggles with the extreme change in health scores on either side, positive or negative. It also shows variance from the actual health score. But, since the direction is generally the same, it is evident that the model is able to predict close to the actual change in health scores demonstrating its ability to accurately predict health risk due to migration.

STEM Visual Abstract

Figure 2: This figure shows the model’s predicted error distribution. The highest frequency of error is very close to 0, wtih only a slight bias toward the negative. Overall the model shows balanced error distribution on both sides from there. This shows that the model does not tend to under predict or over predict, which highlights its strength.

STEM Visual Abstract

Figure 3: This figure shows the model’s loss across epochs. The model shows consistently decreasing loss over epochs showing the need for all 200 epochs as it never plateaus. The model’s loss decreases by a scale factor of almost 4, which shows a very strong decrease of loss, highlighting the model’s capability to learn the data.

STEM Visual Abstract

Figure 4: This figure shows a comparison of the aggregate trajectories of the ground truth and the model predictions. As shown in the figure, the aggregate trajectories follow almost the same path, which highlights the model’s ability to accurately predict the change in health risk over time due to migration. However, the model’s aggregate trajectory is slightly under the true mean, suggesting that the model might slightly under predict. Additionally, the model tends to not predict extreme values which is clear since the light blue shaded area extends longer in the positive and negative directions compared to the light red shaded area.

Analysis

Overall, the trained LSTM model succeeded and was able to accurately predict health outcomes for migrants. The LSTM model performs with the least error compared to several other model structures on the benchmark justifying its use as the model structure for this project. Additionally, the novel model has a consistent decreasing loss across epochs showing its need for 200 epochs of training and its ability to learn the data. The model’s error distribution is balanced, showing that it does not bias toward predictions in either direction and that it can predict both health improvements and declines. The model’s predictions generally match trajectory of health changes, although it struggles with predicting extreme risk accurately and still shows variance.

Key Contributions

  1. A new dataset containing demographics, baseline health, health after relocation, and residential histories specifically for migrants
  2. A benchmark various model structures on their ability to train to this dataset type
  3. A novel continuous health risk score helping define health risk over years
  4. A new deep learning model that is trained on the dataset and accurately predicts health outcomes for people who live in one area and migrate to another

Discussion/Conclusion

This project created the first individualized migrant health prediction tool. It empowers informed decision-making for migrants and establishes a scalable framework for personalized migrant health predictions and preparedness. Additionally, it can be further used by doctors and healthcare professionals to improve suggestions for individuals based on relocation history. It also shows the model’s ability to learn temporal and causal data.

Future Work

  1. Map each geographic state to environmental factors
  2. Create counterfactual experiments for comparisons
  3. Test model on other datasets to evaluate performance
  4. Apply model to specific disease predictions to improve predictions based on relocations and suggest potential relocations for improved disease trajectories

References

February Fair Poster