Stem1

STEM is a class taught by Dr. Kevin Crowthers. This course involves the combination of science, technology, engineering, and mathematics through project-based learning. In the first part of STEM, namely STEM I, we conduct research through a long-term independent science fair project. At the end of our work, we present our process and findings in the February Science Fair.

Using Novel Phrase-Level Explanations in a Softmax-Linked Additive Explainability Model for Transformers

Project Overview

This project extends existing token-level transformer explanation methods to linguistically meaningful phrases by defining a new phrase-level surrogate model fitted through perturbation and alternating optimization. The phrase-level model achieved 58% lower prediction error than the token-level method and correctly identified compositional phrase polarity at 84% accuracy — nearly double the baseline.

Quad Chart

Abstract & Graphical Abstract

Transformer-based language models are increasingly deployed in high-stakes domains such as healthcare, law, and finance, yet they function as black boxes with no transparent reasoning for their outputs. Existing explainable AI methods attempt to address this by assigning importance scores to input features, but these methods assume that token contributions are independent. This assumption violates the softmax mechanism in transformer self-attention, which forces all token weights to sum to one. More recent approaches address this by fitting softmax-linked surrogate models that mirror the transformer’s architecture, but they operate only at the token level, missing phrase-level meaning in explanations. This project extended existing token-level explanation methods to phrases that are linguistically significant by defining a new phrase-level surrogate model. Input sentences were first segmented into phrases using dependency parsing. The model then assigned each phrase two scores: a value score representing the phrase’s absolute contribution to the output, and an importance score representing its relative weight among all other phrases. These scores were fitted by masking entire phrases from the input, observing how the transformer’s output changed, and then optimizing the scores to best predict those changes. On a sentiment classification dataset, the phrase-level model achieved 58% lower prediction error than the original token-level method and correctly identified the polarity of compositional phrases such as negation at 84% accuracy, which is nearly double that of summing the token-score outputs afterwards. These results demonstrate that phrase-level explanations more faithfully capture transformer behavior while producing outputs aligned with how humans naturally read and understand language.

Engineering Need & Objective

Engineering Need

Explanations for transformer language models are unclear as they do not look past the token level. Compositional phrases like “not bad” and “hardly convincing” are split into individual tokens with conflicting scores, producing misleading explanations in high-stakes domains.

Engineering Objective

Develop a surrogate model that accounts for the transformer’s softmax-linked architecture and explains its decisions at the phrase level, producing explanations that are both more faithful to the model’s behavior and more interpretable to human users.

Background

Transformers & the Black Box Problem

Transformer architectures power modern large language models and are deployed in healthcare, law, and finance (Kalyan et al., 2024; Financial Times, 2025).
They process text through a self-attention mechanism that computes relationships between all words simultaneously (Vaswani et al., 2017).
They function as “black boxes” — billions of parameters, no transparent reasoning for their outputs (Adadi & Berrada, 2018).

Explainable AI (XAI) & Feature Attribution

XAI methods open the black box by assigning importance scores to input words, revealing which ones drove a prediction (Van Lent et al., 2004).
The most widely used methods — LIME (Ribeiro et al., 2016) and SHAP (Lundberg & Lee, 2017) — build simple additive surrogate models that assume each token contributes independently.

Why Traditional Methods Fail on Transformers

Softmax, the core function inside transformer attention, forces all token weights to sum to one.
Adding or removing one token redistributes weight across every other token — this breaks the independence assumption that LIME and SHAP rely on.
Leemann et al. (2024) formally proved transformers are structurally incapable of representing additive models.

Current State of the Art

SLALOM (Leemann et al., 2024) is designed specifically for transformers — it assigns each token a value score (absolute contribution) and an importance score (relative weight), combined through softmax to mirror the transformer’s own mechanism.
It achieves higher fidelity than LIME and SHAP (up to 70% lower prediction error).
Key limitation: operates only at the token level — cannot capture phrase-level or compositional meaning (e.g., “not bad” is scored as two separate tokens rather than one positive phrase).

Procedure

Main Steps

Fine-tuned DistilBERT on the IMDB sentiment dataset for binary classification.
Ran existing token-level SLALOM to obtain baseline scores v(t_i) and s(t_i).
Segmented input sentences into phrases using the spaCy dependency parser.
Generated a perturbation dataset by masking entire phrases and recording transformer output changes.
Fitted phrase-level value (V_k) and importance (S_k) scores via alternating OLS/NLS optimization until convergence.
Computed a naive aggregation baseline by summing token scores within phrase boundaries.
Evaluated all methods via fidelity MSE, AOPC deletion, compositional polarity accuracy, and runtime.

Key Equations

Eq. 1 — Token-Level (Existing)

F(t) = Σ_i α_i · v(t_i), where α_i = exp(s(t_i)) / Σ_j exp(s(t_j))

Eq. 2 — Phrase-Level (This Project)

F(t) = Σ_k β_k · V_k, where β_k = exp(S_k) / Σ_j exp(S_j)

Eq. 3 — Naive Aggregation (Baseline)

V_k^naive = Σ_{i∈P_k} α_i · v(t_i)

Results

Analysis

The phrase-level model achieved a fidelity MSE of 0.031 ± 0.006 at 2 phrases removed, compared to 0.069 ± 0.010 for naive aggregation and 0.085 ± 0.012 for the token-level method. This 58% reduction in prediction error demonstrates that re-fitting the surrogate at the phrase level produces significantly more faithful explanations. In the AOPC deletion benchmark, the phrase-level model (0.872) outperformed all baselines with statistical significance (p < 0.05 vs. naive aggregation). The compositional polarity test revealed the starkest difference: 84% accuracy for phrase-level re-fitting versus 46% for naive aggregation and 38% for token-level, confirming that the method captures meaning that post-hoc grouping misses. Runtime averaged 2.8 ± 0.3 seconds per explanation, comparable to existing methods and approximately 4× faster than SHAP (11.2s).

Conclusion & Future Work

Conclusion

Phrase-level re-fitting extends token-level explanations to linguistically meaningful units while preserving softmax-linked transformer compatibility.
Re-fitting the surrogate at the phrase level captures compositional meaning that naive aggregation of token scores misses.
Phrase-level explanations achieve comparable or higher fidelity with fewer units to interpret, better aligning with human language understanding.

Future Work

Evaluate on diverse datasets beyond sentiment analysis, including medical, legal, and multi-topic benchmarks.
Extend the framework to generative tasks such as text generation and question answering.
Test on larger transformer models to assess whether fidelity gains scale with model size.
Replace fixed phrase segmentation with adaptive, data-driven phrase boundaries.
Conduct human evaluation studies to measure improvements in user trust and understanding.

February Fair Poster

References

Abnar, S., & Zuidema, W. (2020). Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4190–4197).
Adadi, A., & Berrada, M. (2018). Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access, 6, 52138–52160. https://doi.org/10.1109/ACCESS.2018.2870052
Azarkhalili, B., & Libbrecht, M. (2025). Generalized attention flow: Feature attribution for transformer models via maximum flow. In Proceedings of the 63rd Annual Meeting of the ACL.
Financial Times Visual Journalism Team. (2025, January 21). Generative AI exists because of the transformer. Financial Times. https://ig.ft.com/generative-ai/
Kalyan, K. S., Rajasekharan, A., & Sangeetha, S. (2024). Transformers and large language models in healthcare: A review. Artificial Intelligence in Medicine, 150, 102578.
Leemann, T., Fastowski, A., Pfeiffer, F., & Kasneci, G. (2024). Attention mechanisms don’t learn additive models: Rethinking feature importance for transformers. Transactions on Machine Learning Research.
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765–4774.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the ACL (pp. 142–150).
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD (pp. 1135–1144).
Van Lent, M., Fisher, W., & Mancuso, M. (2004). An explainable artificial intelligence system for small-unit tactical behavior. Proceedings of the 16th Conference on Innovative Applications of AI.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.

Welcome to Stem I!