STEM is a class taught by Dr. Kevin Crowthers. This course involves the combination of science, technology, engineering, and mathematics through project-based learning. In the first part of STEM, namely STEM I, we conduct research through a long-term independent science fair project. At the end of our work, we present our process and findings in the February Science Fair.
This project extends existing token-level transformer explanation methods to linguistically meaningful phrases by defining a new phrase-level surrogate model fitted through perturbation and alternating optimization. The phrase-level model achieved 58% lower prediction error than the token-level method and correctly identified compositional phrase polarity at 84% accuracy — nearly double the baseline.
Below is a quad chart that provides a high-level visual summary of the project, divided into four quadrants covering the problem, objective, methodology, and main takeaway.
Transformer-based language models are increasingly deployed in high-stakes domains such as healthcare, law, and finance, yet they function as black boxes with no transparent reasoning for their outputs. Existing explainable AI methods attempt to address this by assigning importance scores to input features, but these methods assume that token contributions are independent. This assumption violates the softmax mechanism in transformer self-attention, which forces all token weights to sum to one. More recent approaches address this by fitting softmax-linked surrogate models that mirror the transformer’s architecture, but they operate only at the token level, missing phrase-level meaning in explanations. This project extended existing token-level explanation methods to phrases that are linguistically significant by defining a new phrase-level surrogate model. Input sentences were first segmented into phrases using dependency parsing. The model then assigned each phrase two scores: a value score representing the phrase’s absolute contribution to the output, and an importance score representing its relative weight among all other phrases. These scores were fitted by masking entire phrases from the input, observing how the transformer’s output changed, and then optimizing the scores to best predict those changes. On a sentiment classification dataset, the phrase-level model achieved 58% lower prediction error than the original token-level method and correctly identified the polarity of compositional phrases such as negation at 84% accuracy, which is nearly double that of summing the token-score outputs afterwards. These results demonstrate that phrase-level explanations more faithfully capture transformer behavior while producing outputs aligned with how humans naturally read and understand language.
Explanations for transformer language models are unclear as they do not look past the token level. Compositional phrases like “not bad” and “hardly convincing” are split into individual tokens with conflicting scores, producing misleading explanations in high-stakes domains.
Develop a surrogate model that accounts for the transformer’s softmax-linked architecture and explains its decisions at the phrase level, producing explanations that are both more faithful to the model’s behavior and more interpretable to human users.
The phrase-level model achieved a fidelity MSE of 0.031 ± 0.006 at 2 phrases removed, compared to 0.069 ± 0.010 for naive aggregation and 0.085 ± 0.012 for the token-level method. This 58% reduction in prediction error demonstrates that re-fitting the surrogate at the phrase level produces significantly more faithful explanations. In the AOPC deletion benchmark, the phrase-level model (0.872) outperformed all baselines with statistical significance (p < 0.05 vs. naive aggregation). The compositional polarity test revealed the starkest difference: 84% accuracy for phrase-level re-fitting versus 46% for naive aggregation and 38% for token-level, confirming that the method captures meaning that post-hoc grouping misses. Runtime averaged 2.8 ± 0.3 seconds per explanation, comparable to existing methods and approximately 4× faster than SHAP (11.2s).