Stem is a course taught by Dr. Crowthers, we work on an independent research project for the first two-thirds of the year. We all develop skills such as how to write a grant proposal, collect preliminary data, and research.
In this project, I researched tools that can be combined with Automatic Speech Rognition(ASR) to make a voice assistant that helps people with disabilities with computer navigation.
Automatic Speech Recognition (ASR) is a subfield within Artificial Intelligence (AI) that is commonly used for speech recognition services such as automatic captions, translations, and virtual assistants. Virtual Assistants increase convenience when using devices such as phones and computers as they can open applications, send text messages, and answer questions. However, improvements can be made to modern assistants to help people with disabilities that hinder them from being able to interact with their computer. Because of this problem, a voice assistant that can interact with apps and websites was created to allow for voice-controlled navigation of their computer. Using various Python libraries, website automation services, and app automation services, a voice assistant was created to improve the navigation and intractability of computers. Google’s speech recognition service was used as an automatic speech recognition model to transcribe commands. Using these commands, automation services were used to simulate interaction with an app or website. Data was collected on the speech recognition model in order to determine its transcription accuracy under various background noise levels. With this data, a 1-proportion Z-interval was used in order to determine a confidence interval to predict the interval of the accuracy that the assistant works under those noisy conditions. The results show that the assistant is usable under background noise levels up to 60 decibels. This application will allow people with disabilities to navigate their computer through virtual assistance which is crucial as there are more opportunities for learning, entertainment, and work available through technology every single day.
Click here to see my project proposal
Human-computer Interface are hardware-based and usually very hard to access for those who are disabled. These disabilities can be a physical challenge when using technology. Voice assistants are the solution to this as they can perform tasks such as setting timers, sending text messages, and look up questions. However, these assistants can't perform specific tasks such as clicking on buttons in websites.
The goal of this project is to program a cross-platform assistant that can interact with website elements and interact with files safely. The audience for this can be general, however, it's aim is to help those with disabilities interact and navigate around their computer.
Artificial Intelligence (AI) is used everywhere in daily life, the medical field, and various other settings (Yip et al., 2023; Tinao & Jamisola, 2023). Automatic Speech Recognition (ASR) is a subfield within Artificial Intelligence that is commonly used for transcription services, virtual assistants, and translation devices. Technology is always evolving, and the way we interact with it is also changing. Voice is the easiest and most effective way of communicating with technology. This led to the concept of virtual assistants, an auditory way to interact with technology. We already rely on virtual assistants to turn on and off the lights in a room, stream music, or as search engines (Subhash et al., 2020).
The Voice Assistant can be improved by creating a more accurate Automatic Speech Recognition model as the current Google Speech-to-text model that is being used struggles with recognizing speech in noisy environments. An approach that can be used to fix this is training the ASR model through Spectrum Matched Training as it shows higher accuracies in noisy environments (Prodeus & Kukharicheva, 2016). Another step can be greater application intractability for local applications. Cross-platform file searches can also be implemented as the current program has only been tested through a Windows operating system. Another feature that can be implemented into the program is an intention detection system that will trigger the assistant when it feels that it is needed by detecting if the user is requesting the assistant based on the volume of the request relative to other sentences in the conversation surrounding it (Barton, 2015).
The assistant was programmed in Python due to the language having a large number of libraries. Google’s speech recognition was used to transcribe voice commands. Selenium was used to simulate website interaction. The User Interface (UI) of the assistant was programmed using the TKinter library. The program was made using Visual Studio Code. Audacity was used to vary the background noise level. Text to Speech tools (NaturalReaders, 2019).
Data for proportion accuracy was collected in order to measure what proportion of voices were detected throughout various levels of background noise. Audacity was used to simulate the background noise, and the voices were generated using Speech to Text tools.
Data for Word Error Rate was collected to measure the error rate within a large piece of text throughout various levels of background noise. Audacity was used to generate the background white noise, and the voices were generated using Speech to Text tools.
Proportion of voices detected under various background noise. The detection rate decreases as background noise gets louder. However, this shows that the model can be used under low background noise levels.
These are the confidence intervals using a confidence level of 95%. This means that at a background noise level, we are 95% confident that the accuracy interval will range from 2 percentages (for 40dB it ranged 81.1% to 100%).
The Word Error Rate of the speech recognition model. Higher white noise levels increase the Error Rate of the speech recognition model. The acoustic model takes out background noise pretty well under these background noise levels.
The first graph shows the result of the proportions that were properly detected by the speech recognition model. The results show that the proportions of voices detected decreased at a high level of background noise, while the proportions were nearly perfect at a low level of background noise.
Using the proportions gathered from Graph 1, a 1-Proportion Z-Interval was conducted in order to predict the range of the accuracy rate under a 95% confidence level. The results (Figure 3) tell us that at low background noise levels, the proportion of voices that were recognized range from 81% to 100%.
The second graph shows the result of the Word Error Rate under different white noise levels. Under low noise levels, the acoustic model consistently decreases the background noise to the point that it keeps the same error rate throughout multiple amplitude levels.
These all show that the assistant has a high accuracy at low background noise levels and suggest that using Google's ASR model has decent accuracy for a Minimum Viable Product.
This project aimed to create a voice assistant to help people with disabilities that limit their movement. The proportion of voices that were properly recognized was measured and used to create confidence intervals. The confidence intervals tell us that at a 95% confidence level, the accuracy rate will range within an interval. This shows that at low background noise levels, which could model an environment that someone would use the assistant in, there is a high confidence interval . An Example of this is 40 dB where the interval ranges from 81% to 100% (Table 1) . The objectives of this project were met as the application can be used under various background noise levels. Word Error Rate (WER) was also measured (Graph 2) under various white noise levels. The WER increased as the background noise increased, however, the acoustic model still reduced a lot of it out even through high amplitudes.
Potential limitations to the data are low sample numbers as more samples can bring out higher accuracy while making the intervals. I addressed this by focusing on Word Error Rate in order to find out how much the rate goes up relative to the background noise. Challenges I’ve faced throughout this project involved debugging as at times it was difficult to implement a feature. Utilizing documentation for the libraries I used sometimes helped, but it was a lengthy process of trial and error, which even resulted in making no progress at times.
Much like past assistants (Appalaraju et al., 2024), the assistant can be used for searching terms. However, a lot of assistants that have been made in the past (Appalaraju et al., 2024; Jain et al., 2021) do not include browsing tools. This project takes inspiration from assistants that have used similar tools to facilitate website interaction (Bajpai et al., 2020).
Future studies could build upon current speech recognition capabilities, as well as noise reduction techniques, to improve the accuracy of ASR models.