The Optimization of Large Language Model (LLM) Performance with Data Preparation Techniques

Armaan Priyadarshan

Advisor: Dr. Kevin Crowthers, Ph.D.

Description

This project addresses the challenge of ensuring the quality of massive language model training datasets, which can be as large as terabytes or petabytes. As these datasets grow in size, manually checking for errors becomes impractical. We developed a data cleaning technique specifically for language model datasets, tested it on the WikiText dataset, and found that it significantly improved the model's performance, highlighting the importance of such techniques for maintaining data quality as datasets continue to scale up.

Grant Proposal

Dataset analysis

Preliminary dataset analysis conducted in VSCode

Model training notebook

Model training in Google Colab

Project Notes