~/academics/stem_i

❭ In this class, taught by Dr. Crowthers, we focus on scientific research and engineering. During the first part of the year, we conduct independent research projects that incorporate reviewing literature, making conjectures, developing methodology, designing experiments, and communicating findings. Our final projects are presented at a school-wide science fair, with the possibility for advancement to regional, state, and international fairs.█

The Optimization of Large Language Model (LLM) Performance with Data Preparation Techniques

Armaan Priyadarshan

Advisor: Dr. Kevin Crowthers, Ph.D.

Overview

The overall aim of this project is to investigate the impact of data preparation techniques and overall dataset quality on the performance of Large Language Models (LLMs). Within this project, the effect of the dataset on the model will be evaluated and is expected to have a direct correlation to performance.

Abstract

Graphical Abstract

Research Proposal

Phrase 1

LLMs and datasets are growing in size, and there is a need for automated methods of ensuring data quality

Phrase 2

Data preparation techniques for LLM datasets will be developed and evaluated through their impact on existing datasets and resultant language models

Background

Background Infographic

Procedure

Procedure Infographic

Figure 1

A violin plot of completeness scores of the first 100 training examples in the C4 dataset as evaluated by the BERT model and tokenizer for sequence classification

Figure 2

A decision matrix for determining which risk assessment variable to target in the data cleaning algorithm

Figure 3

The difference in the reported perplexity of the model trained on raw data versus the model trained on cleaned data

Figure 4

The difference in accuracy, F1, recall, and precision of the model trained on raw data versus the model trained on clean data