This project is important for various reasons. First of all, as language models become a bigger and
bigger part of our day-to-day lives, it is of utmost priority to keep them as unbiased and fair as
possible when they make decisions. Furthermore, it is important to understand how they work internally
so that we can monitor for these biases and other safety concerns. This project directly contributes
to evaluating Contrastive Activation Addition as a debiasing intervention for language models, and
it also observes bias-correlated neurons to see their similarities. As this project directly contributes
to a better understanding of both topics, it is an important contribution to the field.
Through testing, this project concluded that CAA had a significant effect on reducing biases on various
real-world benchmarks. The benchmarks used by the project are also a novel innovation that will help
contribute to the field due to their focus on real-world decision-making tasks. Overall, this project
is and its conclusions are useful for everyone from AI companies looking to make fairer models to
interpretability researchers looking to understand how biases are represented internally in language
models. This project contributes to the important cause of making AI models better and fairer for
everyone.