Research Proposal

This project is important for various reasons. First of all, as language models become a bigger and bigger part of our day-to-day lives, it is of utmost priority to keep them as unbiased and fair as possible when they make decisions. Furthermore, it is important to understand how they work internally so that we can monitor for these biases and other safety concerns. This project directly contributes to evaluating Contrastive Activation Addition as a debiasing intervention for language models, and it also observes bias-correlated neurons to see their similarities. As this project directly contributes to a better understanding of both topics, it is an important contribution to the field.

Through testing, this project concluded that CAA had a significant effect on reducing biases on various real-world benchmarks. The benchmarks used by the project are also a novel innovation that will help contribute to the field due to their focus on real-world decision-making tasks. Overall, this project is and its conclusions are useful for everyone from AI companies looking to make fairer models to interpretability researchers looking to understand how biases are represented internally in language models. This project contributes to the important cause of making AI models better and fairer for everyone.

Download

Page: /

Research Proposal

Pictures

Project Notes