Welcome to Stem I!

STEM is a class taught by Dr. Kevin Crowthers. This course involves the combination of science, technology, engineering, and mathematics through project-based learning. In the first part of STEM, namely STEM I, we conduct research through a long-term independent science fair project. At the end of our work, we present our process and findings in the February Science Fair.

Using Novel Phrase-Level Explanations in a Softmax-Linked Additive Explainability Model for Transformers

Project Overview

This project extends existing token-level transformer explanation methods to linguistically meaningful phrases by defining a new phrase-level surrogate model fitted through perturbation and alternating optimization. The phrase-level model achieved 58% lower prediction error than the token-level method and correctly identified compositional phrase polarity at 84% accuracy — nearly double the baseline.

Quad Chart

Below is a quad chart that provides a high-level visual summary of the project, divided into four quadrants covering the problem, objective, methodology, and main takeaway.

Abstract & Graphical Abstract

Transformer-based language models are increasingly deployed in high-stakes domains such as healthcare, law, and finance, yet they function as black boxes with no transparent reasoning for their outputs. Existing explainable AI methods attempt to address this by assigning importance scores to input features, but these methods assume that token contributions are independent. This assumption violates the softmax mechanism in transformer self-attention, which forces all token weights to sum to one. More recent approaches address this by fitting softmax-linked surrogate models that mirror the transformer’s architecture, but they operate only at the token level, missing phrase-level meaning in explanations. This project extended existing token-level explanation methods to phrases that are linguistically significant by defining a new phrase-level surrogate model. Input sentences were first segmented into phrases using dependency parsing. The model then assigned each phrase two scores: a value score representing the phrase’s absolute contribution to the output, and an importance score representing its relative weight among all other phrases. These scores were fitted by masking entire phrases from the input, observing how the transformer’s output changed, and then optimizing the scores to best predict those changes. On a sentiment classification dataset, the phrase-level model achieved 58% lower prediction error than the original token-level method and correctly identified the polarity of compositional phrases such as negation at 84% accuracy, which is nearly double that of summing the token-score outputs afterwards. These results demonstrate that phrase-level explanations more faithfully capture transformer behavior while producing outputs aligned with how humans naturally read and understand language.

Graphical Abstract
View Research Proposal →

Engineering Need & Objective

Background

Transformers & the Black Box Problem

Explainable AI (XAI) & Feature Attribution

Why Traditional Methods Fail on Transformers

Current State of the Art

Procedure

Methodology Flowchart

Main Steps

  1. Fine-tuned DistilBERT on the IMDB sentiment dataset for binary classification.
  2. Ran existing token-level SLALOM to obtain baseline scores v(ti) and s(ti).
  3. Segmented input sentences into phrases using the spaCy dependency parser.
  4. Generated a perturbation dataset by masking entire phrases and recording transformer output changes.
  5. Fitted phrase-level value (Vk) and importance (Sk) scores via alternating OLS/NLS optimization until convergence.
  6. Computed a naive aggregation baseline by summing token scores within phrase boundaries.
  7. Evaluated all methods via fidelity MSE, AOPC deletion, compositional polarity accuracy, and runtime.

Key Equations

Eq. 1 — Token-Level (Existing)
F(t) = Σi αi · v(ti),   where αi = exp(s(ti)) / Σj exp(s(tj))
Eq. 2 — Phrase-Level (This Project)
F(t) = Σk βk · Vk,   where βk = exp(Sk) / Σj exp(Sj)
Eq. 3 — Naive Aggregation (Baseline)
Vknaive = Σi∈Pk αi · v(ti)

Results

Results

Analysis

The phrase-level model achieved a fidelity MSE of 0.031 ± 0.006 at 2 phrases removed, compared to 0.069 ± 0.010 for naive aggregation and 0.085 ± 0.012 for the token-level method. This 58% reduction in prediction error demonstrates that re-fitting the surrogate at the phrase level produces significantly more faithful explanations. In the AOPC deletion benchmark, the phrase-level model (0.872) outperformed all baselines with statistical significance (p < 0.05 vs. naive aggregation). The compositional polarity test revealed the starkest difference: 84% accuracy for phrase-level re-fitting versus 46% for naive aggregation and 38% for token-level, confirming that the method captures meaning that post-hoc grouping misses. Runtime averaged 2.8 ± 0.3 seconds per explanation, comparable to existing methods and approximately 4× faster than SHAP (11.2s).

Conclusion & Future Work

February Fair Poster

References