STEM is a class where students at Mass Academy learn research and engineering skills by working on an independent research project and an assistive technology project. This page focuses on the independent research project (for the assistive technology project, see STEM II). During the independent research project, students learn to document their project, use data analysis tools, and use laboratory materials. Students write a grant proposal and a STEM thesis for this project, which is later submitted to the MSEF/ISEF competition.
My project focuses on fighting the various societal biases that large language models (LLMs) exhibit. Previous work has demonstrated that LLMs show various biases based on race, gender, ethnicity, religion, and more. This work uses a promising new interpretability-based technique called Contrastive Activation Addition (CAA) to change an LLM's behavior and reduce it's measured scores in a bias benchmark. For this project, I use the open-source model Llama 2-Chat.
Large Language Models (LLMs) have recently seen increasing adoption in fields such as law, medicine, and recruiting, where decisions should be kept as unbiased as possible. Previous work has shown that these models' responses reflect various societal biases based on race, gender, occupation, and religion. Contrastive Activation Addition (CAA) is a technique that has shown promise in changing the behavior of language models, and previous work has found it to be more effective than fine-tuning, the traditional approach to altering model behavior, in various circumstances. CAA generates a steering vector that can be added to the activations of a layer during the feed-forward process, using less data than traditional fine-tuning approaches. This project used CAA to reduce the effect of societal biases on the outputs of Llama-3, an LLM by Meta AI. It also aims to observe neurons whose activations are correlated with societal biases, and whether neuron activations tend to correlate to various biases at once. Biases are measured with a numerical benchmark before and after CAA is applied, and two-sample t-tests are used to see if CAA had a significant effect on bias benchmark scores. It was observed that CAA has a statistically significant reduction on bias benchmark scores for racial and gender-based biases. This work provides a valuable methodology for future researchers who are looking to investigate internal representations of biases in language models and for AI companies that aim to reduce the societal biases present in the responses of their premier models.
Large Language Models have consistently shown poor scores in various benchmarks for societal biases, and these models are still largely treated as "black boxes" – little is known about how they operate and process data. How are societal biases represented internally in Large Language Models, and how effective is Contrastive Activation Addition in reducing these biases?
It was hypothesized that steering will have a significant effect on reducing bias scores, and that it will have the greatest effect when applied to one of the layers near the center of the model rather than layers near the beginning or end of the network. Furthermore, it was hypothesized that neurons and circuits corresponding different societal biases are similar and correlated – neurons that likely activate to racial bias, for example, also activate in response to gender-based biases.
In recent times, language models have grown from a topic of research to an everyday tool. Today, these
models are increasingly being used for important decisions such as processing resumes of promising
hires (Deshmukh & Raut, 2024), choosing how medicine is administered (Giordano et al., 2021), and
writing legal documents that are used in court (Khan et al., 2024). However, these models are trained
on human-made data, and they often adopt human societal biases from this data. Today's most advanced
language models still score around two times worse than humans on bias benchmarks involving societal
biases such as race or gender bias (Chen et al., 2024).
Much work has been done to date regarding the identification and measurement of biases in language
models. Previous papers have established that models show clear biases related to race (Yang et al.,
2024), gender (Kotek et al., 2023), occupation (Xue et al., 2023), and religion (Abid et al., 2021).
Previous work has identified various points of intervention for biases, from the word embeddings stage
(Papakyriakopoulos et al., 2020) to the prompt-engineering stage (Bevara et al., 2024). However, the
emerging field of mechanistic interpretability has allowed for internal techniques that have shown
promise. Researchers have managed to locate circuits corresponding to different items and concepts
in the transformer networks behind large language models. For example, one paper had identified circuits
that correspond to honesty in the outputs of the model Llama-2 Chat (Zou et al., 2023) using approaches
from the field of representation engineering, which observes neuron activations to different prompts
in order to correlate neurons to concepts.
This field is additionally relevant as it has identified ways to steer towards or away from certain
concepts in these models. Particularly, an approach called Contrastive Activation Addition (CAA) has
shown promise in being more effective than traditional fine-tuning based approaches for adjusting
model behavior. Fine-tuning relies on additional labelled data to train models for specific tasks,
while CAA requires far less data for similar results (Panickssery et al., 2023). This helps AI companies
save on costs and energy, while also mitigating the environmental impacts of storing and training
the model on much additional data. A CAA-based approach has previously been attempted with Llama-2
Chat (Panickssery et al., 2023) and models from the GPT-2 family (Turner et al., 2023), where it has
been used to steer away from concepts like sycophancy and towards concepts like happiness.
This project uses a technique called Contrastive Activation Addition (CAA) to change the behavior
of the Llama 3-3B model and debias it. CAA works by giving the model a set of prompts whose content
demonstrates the desired behavior, and taking the average activations to these prompts at a particular
layer of choice. The same is done for a set of prompts that discourage the desired behavior on the
same layer. The difference between these average activations are taken and used as a steering vector.
To steer the model for future prompts, this steering vector can be added to the residual stream after
the model activations on the same layer of choice for another prompt.
Steering vectors were generated for both racial and gender biases using the process described above.
Then, the model was scored on a benchmark that evaluated biases and focused on the model's decision
making in real-world scenarios. Bias scores were taken for a legal sentencing scenario, a hiring scenario,
and a medical administration scenario. Biases were measured before and after steering was applied,
and the steering vectors were added with a coefficient of 12. The steering vectors were each tested
on layer 17, 20, and 25 to see how the layer that steering is applied on would affect the model's
bias benchmark scores.
To evaluate the significance of data, a two-sample t-test was taken on the bias benchmark scores for
each scenario without steering and with steering on layer 20. A two-sample t-test was used instead
of an ANOVA test because this project aimed to see how specific biases were affected and how they
were affected in specific scenarios, instead of making general conclusions about the variances of
benchmark scores.
These results show bias benchmark scores by Llama 3 (3 billion parameters) for different people groups. Racial
and gender-based biases we reduced on all benchmarks, with all results having p < 0.05. This suggests that
the null hypothesis should be rejected.
The data indicates:
These results are extremely valuable to AI companies and future researchers. Today’s leading models demonstrate
various biases, and it is a problem that many companies use fine-tuning based approaches for. These approaches
require additional data or human feedback (reinforcement learning), which requires additional time and money.
Given that these models are used for everything from medicine administration (Giordano et al., 2021) to job
hiring (Desmukh & Raut, 2024) today, eliminating biases is a major priority. This work can help provide a
cost-effective alternative solution in the form of CAA.
These results are also valuable because they uncover important details for the interpretability of models
like Llama 3. By finding bias-correlated neurons across layers, this project was able to build on understanding
of how the model processes information to generate outputs. This result seems to show that bias representations
are not local to a specific part of the network, which is useful for future researchers looking for the best
part of the model that debiasing interventions may be applied.
AI shows various biases on the basis of race and gender. Interpretability-based approaches like contrastive
activation addition (CAA) have shown promise for related tasks.
This project uses CAA with a custom benchmark for these high-stakes decisions in order to measure the difference
that CAA can make in biases in large language models. It also investigates the neurons most correlated with
societal biases to see if neurons that correlate with one bias correlate with others as well.
Results indicate that CAA does cause a significant reduction in biases, and bias-correlated neurons are very
similar to one another. CAA was found most effective around layer 20.