What Is ASR?
Updated: October 23, 2023
When you think of artificial intelligence, what do you think of? You might think of self-driving cars or the facial recognition software you use to unlock your device. No matter your familiarity level, artificial intelligence has become increasingly prevalent in our everyday lives, including speech recognition. So, what is ASR?
Artificial intelligence, commonly known as AI, is all around us. It uses machine learning to perform tasks and solve problems like a human. They help make our lives easier by making faster decisions, helping with repetitive tasks, and taking calculated risks.
So, where does ASR come into play? If artificial intelligence is the tree, then ASR is the branch. AI is the larger, overarching umbrella, while ASR is a subset of it.
In this post, we’ll go over what ASR is and how it’s used, particularly when it comes to captioning your content. Let’s dive in!
What is ASR? A Broad Overview:
ASR, or automatic speech recognition, is the process of a computer transcribing audio into text.
It was once expensive to access ASR software but thanks to technological advancements, it’s become more affordable and accessible than ever before. You can find ASR technology in many of the apps we use today, like Zoom and TikTok. These applications use your voice to create captions that overlay your videos.
Another example of ASR is the automated customer service over the telephone. Think of when you call your bank; you usually have to go through a series of questions with an automated rep before you speak to a human.
These different examples showcase the two main types of ASR: directed dialogue and natural language processing (NLP).
Directed dialogue is the simpler version of ASR. The speech recognition allots a limited amount of words you can use as responses. In the bank call example, the automated rep might give you a list of requests such as hours of operation, updating account information, or speaking to a customer service agent. As the person on the other end, you’ll only be able to choose from the given options. The ASR isn’t advanced enough to take other, more complicated requests.
NLP, on the other hand, is the more sophisticated version of ASR that allows the user to have more open-ended conversations – similar to how humans communicate.
On average, an NLP ASR system consists of 60,000 or more words. It would be extremely inefficient for a system to process every single word so it selects specific keywords and gives context to longer requests.
An example of this is Apple’s voice-controlled digital assistant, Siri. If you ask Siri, “what’s the weather today?”, it’ll likely select “weather” as the main keyword and proceed to share the day’s forecast. This allows the system to process requests more efficiently.
Humans and Technology: The Best of Both Worlds
Some of the best ASR systems can achieve an accuracy rate of 80%. However, this is only possible if audio conditions align perfectly – which is easier said than done. As audio conditions worsen, the accuracy rate quickly diminishes.
An 80% accuracy rate might be sufficient for personal assistants, like Siri, but when it comes to professional captioning and transcription, ASR alone doesn’t measure up. Humans are still needed in the captioning process.
Relying solely on ASR for captioning just doesn’t cut it. Captioning is a complex process that sometimes includes multiple speakers, accents, and non-speech elements. They can obstruct the ASR software from accurately picking up on what’s being said in the audio. Perfect audio conditions would normally exclude these elements, which is highly unlikely.
In normal circumstances, there will be a number of errors in a transcript performed by ASR alone. These are the most common causes of ASR errors:
- Speaker labels
- Punctuation, grammar, and numbers
- Non-speech elements
- [INAUDIBLE] tags
- Multiple speakers or overlapping speech
- Background noise or poor audio quality
- False starts
- Acoustic error
A frequent area in which ASR falls short is when it comes to small “function” words, which are important in conveying meaning in speech. Think of the sentence “I can’t go with you” versus “I can go with you”. One small error can drastically change the meaning of a conversation. With humans, the chances of these common mistakes decrease substantially since we are able to use nuance and context clues – things that technology hasn’t been developed enough to do yet.
With ASR and humans, you get the best of both worlds. In the next section, we’ll cover 3Play Media’s approach to captioning and how leveraging humans and technology creates a recipe for greatness!
Captioning The 3Play Way
At 3Play Media, technology has played an important role in the captioning process. Our patent-pending 3-step process combines ASR technology and professional human editors to maximize and streamline how we caption your content.
ASR is the first step of 3Play’s captioning process. Once a file is uploaded into our account system, the ASR goes through the file and creates a rough draft.
3Play’s ASR engine out-performs most software on the market, including Google, IBM Watson, Rev’s Temi, and Trint. Our software has an average accuracy rate of 90.91% while the others averaged between 80-89%.
After the first round of ASR, the second round of editing consists of a human transcriber reviewing the transcript and cleaning up the draft where needed.
Finally, a quality assurance (QA) manager reviews the transcript a final time to ensure the highest level of accuracy.
We guarantee at least a 99% accuracy rate on all of your files because we understand how critical accuracy is to the captioning process. Not only does it ensure that your content is accessible to d/Deaf and hard of hearing viewers, but it also ensures that your organization is in compliance with major accessibility laws.
As a company that is constantly evolving and innovating our features and services, we always want to use the best technology. Every year we publish the “State of Automatic Speech Recognition” report to test the most popular ASR technologies on the market and how our technology compares. Check out the full report below to uncover the current state of ASR in regard to captioning accuracy!