Nair - STEM I

Abstract

Large Language Models (LLMs) have recently seen increasing adoption in fields such as law, medicine, and recruiting, where decisions should be kept as unbiased as possible. Previous work has shown that these models' responses reflect various societal biases based on race, gender, occupation, and religion. Contrastive Activation Addition (CAA) is a technique that has shown promise in changing the behavior of language models, and previous work has found it to be more effective than fine-tuning, the traditional approach to altering model behavior, in various circumstances. CAA generates a steering vector that can be added to the activations of a layer during the feed-forward process, using less data than traditional fine-tuning approaches. This project used CAA to reduce the effect of societal biases on the outputs of Llama-3, an LLM by Meta AI. It also aims to observe neurons whose activations are correlated with societal biases, and whether neuron activations tend to correlate to various biases at once. Biases are measured with a numerical benchmark before and after CAA is applied, and two-sample t-tests are used to see if CAA had a significant effect on bias benchmark scores. It was observed that CAA has a statistically significant reduction on bias benchmark scores for racial and gender-based biases. This work provides a valuable methodology for future researchers who are looking to investigate internal representations of biases in language models and for AI companies that aim to reduce the societal biases present in the responses of their premier models.

Researchable Question

Large Language Models have consistently shown poor scores in various benchmarks for societal biases, and these models are still largely treated as "black boxes" – little is known about how they operate and process data. How are societal biases represented internally in Large Language Models, and how effective is Contrastive Activation Addition in reducing these biases?

Hypothesis

It was hypothesized that steering will have a significant effect on reducing bias scores, and that it will have the greatest effect when applied to one of the layers near the center of the model rather than layers near the beginning or end of the network. Furthermore, it was hypothesized that neurons and circuits corresponding different societal biases are similar and correlated – neurons that likely activate to racial bias, for example, also activate in response to gender-based biases.

Background

In recent times, language models have grown from a topic of research to an everyday tool. Today, these models are increasingly being used for important decisions such as processing resumes of promising hires (Deshmukh & Raut, 2024), choosing how medicine is administered (Giordano et al., 2021), and writing legal documents that are used in court (Khan et al., 2024). However, these models are trained on human-made data, and they often adopt human societal biases from this data. Today's most advanced language models still score around two times worse than humans on bias benchmarks involving societal biases such as race or gender bias (Chen et al., 2024).

Much work has been done to date regarding the identification and measurement of biases in language models. Previous papers have established that models show clear biases related to race (Yang et al., 2024), gender (Kotek et al., 2023), occupation (Xue et al., 2023), and religion (Abid et al., 2021). Previous work has identified various points of intervention for biases, from the word embeddings stage (Papakyriakopoulos et al., 2020) to the prompt-engineering stage (Bevara et al., 2024). However, the emerging field of mechanistic interpretability has allowed for internal techniques that have shown promise. Researchers have managed to locate circuits corresponding to different items and concepts in the transformer networks behind large language models. For example, one paper had identified circuits that correspond to honesty in the outputs of the model Llama-2 Chat (Zou et al., 2023) using approaches from the field of representation engineering, which observes neuron activations to different prompts in order to correlate neurons to concepts.

This field is additionally relevant as it has identified ways to steer towards or away from certain concepts in these models. Particularly, an approach called Contrastive Activation Addition (CAA) has shown promise in being more effective than traditional fine-tuning based approaches for adjusting model behavior. Fine-tuning relies on additional labelled data to train models for specific tasks, while CAA requires far less data for similar results (Panickssery et al., 2023). This helps AI companies save on costs and energy, while also mitigating the environmental impacts of storing and training the model on much additional data. A CAA-based approach has previously been attempted with Llama-2 Chat (Panickssery et al., 2023) and models from the GPT-2 family (Turner et al., 2023), where it has been used to steer away from concepts like sycophancy and towards concepts like happiness.

Procedure

This project uses a technique called Contrastive Activation Addition (CAA) to change the behavior of the Llama 3-3B model and debias it. CAA works by giving the model a set of prompts whose content demonstrates the desired behavior, and taking the average activations to these prompts at a particular layer of choice. The same is done for a set of prompts that discourage the desired behavior on the same layer. The difference between these average activations are taken and used as a steering vector. To steer the model for future prompts, this steering vector can be added to the residual stream after the model activations on the same layer of choice for another prompt.

Steering vectors were generated for both racial and gender biases using the process described above. Then, the model was scored on a benchmark that evaluated biases and focused on the model's decision making in real-world scenarios. Bias scores were taken for a legal sentencing scenario, a hiring scenario, and a medical administration scenario. Biases were measured before and after steering was applied, and the steering vectors were added with a coefficient of 12. The steering vectors were each tested on layer 17, 20, and 25 to see how the layer that steering is applied on would affect the model's bias benchmark scores.

To evaluate the significance of data, a two-sample t-test was taken on the bias benchmark scores for each scenario without steering and with steering on layer 20. A two-sample t-test was used instead of an ANOVA test because this project aimed to see how specific biases were affected and how they were affected in specific scenarios, instead of making general conclusions about the variances of benchmark scores.

Results

_{This figure shows racial and gender-based bias benchmark scores by the layer that steering was applied.

These scores were taken on the legal benchmark.}

_{This figure shows racial and gender-based bias benchmark scores by the layer that steering was applied.

These scores were taken on the hiring and medical benchmark (they were very similar so they were grouped together).}

_{This figure shows (max-pooled) racial bias-correlated neurons in the model by layer (layer is on the vertical axis).}

_{This figure shows (max-pooled) gender bias-correlated neurons in the model by layer (layer is on the vertical axis).}

_{This figure shows the similarity between racial and gender bias vectors across the layers of the model.}

Analysis

These results show bias benchmark scores by Llama 3 (3 billion parameters) for different people groups. Racial and gender-based biases we reduced on all benchmarks, with all results having p < 0.05. This suggests that the null hypothesis should be rejected.

The data indicates:

Steering vectors reduce bias (p < 0.05 for racial and gender-based bias reduction on both benchmarks between no-steering results and steering at layer 20 results)
The optimal location to apply steering is around layer 20
Bias-correlated neurons are sparse and spread throguhout the network, they are not at any particular location
Cosine similarity of steering vectors was around 0.9, so bias-correlated neurons are very similar – racial and gender biases have similar internal representations in the model

All data collection was performed on Llama 3-3B, using a steering coefficient of 12. A two-sample t-test was used to compare dataset means. For both the legal benchmark and the job hiring benchmark, and for both racial and gender bias scores, the sample of scores before steering and after steering on layer 20 were compared (160 scores in each sample). In all cases, the null hypothesis assumed equal means, while the alternative assumed a lower bias score after steering. After a two-sample t-test was performed on each of the four pairs of samples, all results were found as significant for p < 0.05.

Limitations of current methodology:

Only measuring two types of bias (race, gender)
Only using 3 billion parameter model
Steering coefficient was fixed at 12

Despite these limitations, these results show that CAA is a promising strategy for bias mitigation, and they outline how to best use CAA for this purpose by testing it across layers. These results also align well with the stated hypothesis.

Discussion

These results are extremely valuable to AI companies and future researchers. Today’s leading models demonstrate various biases, and it is a problem that many companies use fine-tuning based approaches for. These approaches require additional data or human feedback (reinforcement learning), which requires additional time and money.

Given that these models are used for everything from medicine administration (Giordano et al., 2021) to job hiring (Desmukh & Raut, 2024) today, eliminating biases is a major priority. This work can help provide a cost-effective alternative solution in the form of CAA.

These results are also valuable because they uncover important details for the interpretability of models like Llama 3. By finding bias-correlated neurons across layers, this project was able to build on understanding of how the model processes information to generate outputs. This result seems to show that bias representations are not local to a specific part of the network, which is useful for future researchers looking for the best part of the model that debiasing interventions may be applied.

Conclusion

AI shows various biases on the basis of race and gender. Interpretability-based approaches like contrastive activation addition (CAA) have shown promise for related tasks.

This project uses CAA with a custom benchmark for these high-stakes decisions in order to measure the difference that CAA can make in biases in large language models. It also investigates the neurons most correlated with societal biases to see if neurons that correlate with one bias correlate with others as well.

Results indicate that CAA does cause a significant reduction in biases, and bias-correlated neurons are very similar to one another. CAA was found most effective around layer 20.

STEM

Using Contrastive Activation Addition to Combat Societal Biases in Language Models