How do 3Play’s Live Captions Compare to Zoom’s Built-in Captions?
Updated: July 9, 2021
Artificial intelligence-based automatic speech recognition (ASR) is one step of 3Play Media’s innovative transcription process, and it’s also what powers our live captioning solution. As a result, we’re deeply invested in following trends in the ASR industry in order to make sure we are powering our solutions with the technology which will provide the most accurate results possible. Every year, 3Play Media releases our annual report on the state of automatic speech recognition where we test many of the leading speech recognition technologies available on the market to ensure we’re powering our solutions with top-of-the-line technology, year after year.
In November of 2020, Zoom announced its collaboration with Otter.ai to provide live transcription and captioning for Zoom meetings. Users with both Zoom Pro and Otter for Teams receive this feature without any additional price per minute.
With all the attention surrounding this announcement, we felt compelled to investigate the accuracy of Otter’s real-time ASR. How does it stack up against 3Play’s live captioning solution, and what does any difference in accuracy mean for caption quality and understandability?
Download the 2020 Annual State of ASR Report
Our Investigation
In order to test both Otter and our own ASR provider, Speechmatics v2 Real-Time, we collected video content that was representative of the type of content our customers ask us to transcribe. We sourced this content from a diverse set of domains to make sure we were covering as many customer use cases as possible.
The content fell across the categories of education, health, sports, news, entertainment, and corporate video. In total, we used a little over four hours of content which contained over 30,000 spoken words.
We used the audio from these files to generate Otter-powered transcripts. Then, we used the same audio with 3Play Media’s live auto-captioning solution to generate Speechmatics-powered transcripts.
We used 3Play Media’s 99% accurate transcription process to generate “truth” transcripts and ran these through an additional step of human review to ensure extremely high quality. Then, we used these transcripts to score the accuracy of both the Otter auto-captions and the 3Play Media auto-captions.
Results
We measure accuracy using a standard metric called word error rate (WER). Word Error Rate is a percentage generated by dividing the count of errors in the transcript by the count of words in the “truth” transcript. In other words, with a WER of 10%, you would expect to see one error for every 10 words spoken.
The types of errors encountered can be split into three categories. Substitution errors are the count of incorrectly recognized words, where the correct word was “substituted” with an incorrect one. Insertion errors are extra words recognized by speech recognition that aren’t actually present in the speech. Finally, deletion errors are words that were missed or omitted completely by the ASR.
This error count does not count errors in punctuation or formatting. Error rate including punctuation and formatting errors would be called Formatted Error Rate, or FER.
ASR Engine | % Error | % Substitution | % Insertion | % Deletion |
Speechmatics v2 Real-Time | 16.33 | 6.90 | 5.15 | 4.28 |
Otter.ai | 22.07 | 8.51 | 4.37 | 9.19 |
3Play’s Speechmatics-powered captioning solution outperformed Zoom’s Otter-powered solution with a 26% lower word error rate.
Function Words
Function words are words that perform a grammatical function rather than introducing meaning into a sentence. Examples include words like “the”, “do”, “and’, and “can”. These words fill very important roles that can change the meaning of a sentence. They’re also often misrecognized by ASR because they are often shortened or reduced in speech.
For example, the words “can” and “can’t” sound very similar, but mean completely opposite things. After analyzing substitution errors from both engines, we found that Otter.ai was twice as likely as Speechmatics v2 Real-Time to mix up “can” and “can’t”.
The table below shows how many times these function words were substituted for any other incorrect word, for each vendor.
Word | Speechmatics v2 Real-Time | Otter.ai |
the | 2.74% | 3.27% |
a | 3.76% | 4.01% |
do or don’t | 3.49% | 4.20% |
can or can’t | 2.04% | 2.72% |
Speechmatics v2 Real-Time performed better for all function words evaluated.
Discover How ASR Engines Impact Caption Quality ➡️
What does this mean for you?
For a 12-word long sentence, a 16% error rate will result in an average of 1.92 errors. At a 22% error rate, a 12-word sentence averages 2.64 errors.
Or, in other words, a 16% error rate means users will see an error about once every 6.25 words, while a 22% error rate means they will see an error every 4.54 words.
Error Examples
The level of disruption to understandability caused by errors can vary greatly. The errors below come from a mixture of both speech recognition engines. These examples can demonstrate the level of impact an error can have on captions for those determining the importance of accuracy when choosing a live captioning vendor.
Error
- “… the difference between deductive and inductive influences.”
Correction
- “… the difference between deductive and inductive inferences.”
Error
- “for honest you can be more helpful it is.”
Correction
- “The more honest you can be, the more helpful it is.”
Error
- “… and we lose.”
Correction
- “… and then we lose.”
Error
- “Barbie queuers Morgan’s borders…”
Correction
- “Barbequers, smorgashborders… “
Error
- “The privilege only extends to fax.”
Correction
- “The privilege only extends to facts.”
Error
- “… what size window guard you need.”
Correction
- “… what size window guards you need.”
Everything You Need to Know About ASR Technologies➡️
Deletions
One place we found that 3Play’s solution particularly stood out was the rate of deletion errors. Otter.ai had over twice as many deletion errors as Speechmatics v2 Real-Time and is “deleting” or omitting almost one in every 10 spoken words.
This error type in particular can really impact participants who rely on captions as an accommodation. If captions are omitted, users might not only miss out on the content but also on the fact that something is being said at all. Additionally, if you use real-time transcription to generate meeting notes and transcripts for later reference, important information could be missing from the resulting transcript and be forgotten.
Achieving the highest accuracy
At 3Play Media, we believe that accuracy is crucial. Live captions can only create engagement, equal access, or improved understanding if they are sufficiently accurate. We are committed to seeking out the highest quality technology to ensure that our customers are getting the greatest benefit they can from our captioning.
No matter what method you are using to live caption your videos, following some best practices can help you optimize the resulting accuracy.
This blog post was written by Tessa Kettelberger, Research and Development Engineer at 3Play Media.
—
Further Reading
Subscribe to the Blog Digest
Sign up to receive our blog digest and other information on this topic. You can unsubscribe anytime.
By subscribing you agree to our privacy policy.