Stem I

Stem 1: Independent Research Project

Stem is a course taught by Dr. Crowthers, we work on an independent research project for the first two-thirds of the year. We all develop skills such as how to write a grant proposal, collect preliminary data, and research.

Utilizing Automatic Speech Recognition to Create a Desktop Assistant

In this project, I researched tools that can be combined with Automatic Speech Rognition(ASR) to make a voice assistant that helps people with disabilities with computer navigation.

Abstract

Automatic Speech Recognition (ASR) is a subfield within Artificial Intelligence (AI) that is commonly used for speech recognition services such as automatic captions, translations, and virtual assistants. Virtual Assistants increase convenience when using devices such as phones and computers as they can open applications, send text messages, and answer questions. However, improvements can be made to modern assistants to help people with disabilities that hinder them from being able to interact with their computer. Because of this problem, a voice assistant that can interact with apps and websites was created to allow for voice-controlled navigation of their computer. Using various Python libraries, website automation services, and app automation services, a voice assistant was created to improve the navigation and intractability of computers. Google’s speech recognition service was used as an automatic speech recognition model to transcribe commands. Using these commands, automation services were used to simulate interaction with an app or website. Data was collected on the speech recognition model in order to determine its transcription accuracy under various background noise levels. With this data, a 1-proportion Z-interval was used in order to determine a confidence interval to predict the interval of the accuracy that the assistant works under those noisy conditions. The results show that the assistant is usable under background noise levels up to 60 decibels. This application will allow people with disabilities to navigate their computer through virtual assistance which is crucial as there are more opportunities for learning, entertainment, and work available through technology every single day.

Graphical Abstract

This is my graphical abstract explaining my project.

Click here to see my project proposal

Engineering Need

Human-computer Interface are hardware-based and usually very hard to access for those who are disabled. These disabilities can be a physical challenge when using technology. Voice assistants are the solution to this as they can perform tasks such as setting timers, sending text messages, and look up questions. However, these assistants can't perform specific tasks such as clicking on buttons in websites.

Engineering Objective

The goal of this project is to program a cross-platform assistant that can interact with website elements and interact with files safely. The audience for this can be general, however, it's aim is to help those with disabilities interact and navigate around their computer.​

Background

Artificial Intelligence (AI) is used everywhere in daily life, the medical field, and various other settings (Yip et al., 2023; Tinao & Jamisola, 2023). Automatic Speech Recognition (ASR) is a subfield within Artificial Intelligence that is commonly used for transcription services, virtual assistants, and translation devices. Technology is always evolving, and the way we interact with it is also changing. Voice is the easiest and most effective way of communicating with technology. This led to the concept of virtual assistants, an auditory way to interact with technology. We already rely on virtual assistants to turn on and off the lights in a room, stream music, or as search engines (Subhash et al., 2020).

How Automatic Speech Recognition Works

ASR is the driving principle behind speech recognition. It complements another component of speech recognition, Natural Language Processing (NLP) which allows virtual assistants to comprehend both the meaning of what a user has stated, but also to infer implicit nuances (Bajpai et al., 2024). Natural Language Processing creates a better contextual understanding of user queries through semantic analysis. Automatic Speech Recognition systems initially record speech and save the speech as a file, the file will then be cleaned of any background noise and analyze what is stated sequentially. Probability tests are applied to recognize all the words that complete the input. Finally, it will produce an output in the form of text content (Subhash et al., 2020).

Past Use of Automatic Speech Recognition

Virtual Voice Assistants are a common application of Automatic Speech Recognition. Most people have used Siri, Alexa, Cortana, or Google Assistant in order to search for information, send text messages, or play music. A great application of voice assistants can be for those with restricted movements that make it difficult to use technology. Verbal communication with a virtual assistant can accomplish tasks without having to touch the device. Voice Assistants are a way to make our lives more convenient by reducing the time it takes to complete tasks. Today, Voice Assistants are integrated into many devices in our lives, such as cell phones and computers, and are available to the general public. Some assistants are hardware-based and made to do one thing, like Alexa’s wall clock, and others are software-based like the assistants someone would find in a phone (Singh et al., 2022). While they are great for retrieving information, there is still room for improvement when it comes to user intractability. Common voice assistants like Siri, often contain simple app and web interaction features that can retrieve information like searches. However, they lack giving users more intractability with websites. This kind of feature can be significantly helpful to those who have limited mobility.

Tools

The goal of this project was to fix weaknesses that are found in Voice Assistants. These weaknesses were fixed using a programming language named Python, which contains large amounts of resources and libraries that were utilized for this project. This software can be used by anyone; however, it is meant to aid those with restricted mobility due to diseases and accidents. The program is split into two main components, speech recognition and local interaction. The first component uses the speech recognition Python library. This library allows someone to speak and then return the content in text form. It uses the same concept of Automatic Speech Recognition and helps identify commands. The second component is more complicated as there are many forms of local interaction. The program used the Python libraries WebBrowser and PyAutoGui for tab control and key bind access. It also used Selenium Web Driver to interact with web elements. At times, the program will speak to the user. This is done using the pyttsx3 library. The input can be a standard user microphone, and the output can be a standard speaker.

Future Steps

The Voice Assistant can be improved by creating a more accurate Automatic Speech Recognition model as the current Google Speech-to-text model that is being used struggles with recognizing speech in noisy environments. An approach that can be used to fix this is training the ASR model through Spectrum Matched Training as it shows higher accuracies in noisy environments (Prodeus & Kukharicheva, 2016). Another step can be greater application intractability for local applications. Cross-platform file searches can also be implemented as the current program has only been tested through a Windows operating system. Another feature that can be implemented into the program is an intention detection system that will trigger the assistant when it feels that it is needed by detecting if the user is requesting the assistant based on the volume of the request relative to other sentences in the conversation surrounding it (Barton, 2015).

Infographic

This is the Infographic of my background research.

Procedure

The assistant was programmed in Python due to the language having a large number of libraries. Google’s speech recognition was used to transcribe voice commands. Selenium was used to simulate website interaction. The User Interface (UI) of the assistant was programmed using the TKinter library. The program was made using Visual Studio Code. Audacity was used to vary the background noise level. Text to Speech tools (NaturalReaders, 2019).

Data for proportion accuracy was collected in order to measure what proportion of voices were detected throughout various levels of background noise. Audacity was used to simulate the background noise, and the voices were generated using Speech to Text tools.

Data for Word Error Rate was collected to measure the error rate within a large piece of text throughout various levels of background noise. Audacity was used to generate the background white noise, and the voices were generated using Speech to Text tools.

Infographic

This is my procedure for data collection

Figure 1

This is my data using proportions as a measurement of transcription accuracy.

Proportion of voices detected under various background noise. The detection rate decreases as background noise gets louder. However, this shows that the model can be used under low background noise levels.

Figure 2

This is my table of confidence intervals.

These are the confidence intervals using a confidence level of 95%. This means that at a background noise level, we are 95% confident that the accuracy interval will range from 2 percentages (for 40dB it ranged 81.1% to 100%).

Figure 3

This is my data using Word Error Rate as a transcription metric.

The Word Error Rate of the speech recognition model. Higher white noise levels increase the Error Rate of the speech recognition model. The acoustic model takes out background noise pretty well under these background noise levels.

Analysis

The first graph shows the result of the proportions that were properly detected by the speech recognition model. The results show that the proportions of voices detected decreased at a high level of background noise, while the proportions were nearly perfect at a low level of background noise.

Using the proportions gathered from Graph 1, a 1-Proportion Z-Interval was conducted in order to predict the range of the accuracy rate under a 95% confidence level. The results (Figure 3) tell us that at low background noise levels, the proportion of voices that were recognized range from 81% to 100%.

The second graph shows the result of the Word Error Rate under different white noise levels. Under low noise levels, the acoustic model consistently decreases the background noise to the point that it keeps the same error rate throughout multiple amplitude levels.

These all show that the assistant has a high accuracy at low background noise levels and suggest that using Google's ASR model has decent accuracy for a Minimum Viable Product.

Discussion

This project aimed to create a voice assistant to help people with disabilities that limit their movement. The proportion of voices that were properly recognized was measured and used to create confidence intervals. The confidence intervals tell us that at a 95% confidence level, the accuracy rate will range within an interval. This shows that at low background noise levels, which could model an environment that someone would use the assistant in, there is a high confidence interval . An Example of this is 40 dB where the interval ranges from 81% to 100% (Table 1) . The objectives of this project were met as the application can be used under various background noise levels. Word Error Rate (WER) was also measured (Graph 2) under various white noise levels. The WER increased as the background noise increased, however, the acoustic model still reduced a lot of it out even through high amplitudes.

Potential limitations to the data are low sample numbers as more samples can bring out higher accuracy while making the intervals. I addressed this by focusing on Word Error Rate in order to find out how much the rate goes up relative to the background noise. Challenges I’ve faced throughout this project involved debugging as at times it was difficult to implement a feature. Utilizing documentation for the libraries I used sometimes helped, but it was a lengthy process of trial and error, which even resulted in making no progress at times.

Much like past assistants (Appalaraju et al., 2024), the assistant can be used for searching terms. However, a lot of assistants that have been made in the past (Appalaraju et al., 2024; Jain et al., 2021) do not include browsing tools. This project takes inspiration from assistants that have used similar tools to facilitate website interaction (Bajpai et al., 2020).

Future Research

Future studies could build upon current speech recognition capabilities, as well as noise reduction techniques, to improve the accuracy of ASR models.

References

February Fair Poster