Artificial Intelligence Is Good, but Is It Good Enough for Captions?

April 24, 2019 BY ELISA LEWIS
Updated: August 28, 2019

“Artificial intelligence (AI) makes it possible for machines to learn from experience, adjust to new inputs, and perform human-like tasks.” In other words, rather than programming machines with a single rule for output, they are taught to recognize patterns and then utilize those patterns to make decisions somewhat like a human might.

AI has many applications, and as it continues to develop coincident to the amount of video that requires captioning, we are often asked: When will there be a viable, fully automated solution for closed captioning?

In this post, we will take a look at the implications of artificial intelligence and machine learning for automatic speech recognition, and how this may be used for automatic captioning.

The Current State of AI

In recent years we’ve seen the rise of digital assistants with Siri and Alexa, chatbots, personalized search results (that are sometimes so good it’s creepy!) and increased reliance on speech recognition to dictate text messages, emails, and so on. But what does the current state of AI mean for automatic speech recognition (ASR?)

Understanding AI, ML, and ASR

Before we discuss the relationship between artificial intelligence, machine learning, and automatic speech recognition, let’s take a look at what each of these terms means:

Artificial Intelligence (AI): refers to intelligence demonstrated by machines
Machine Learning (ML): allows machines to learn outputs from previous experience
Automated Speech Recognition (ASR): converts spoken words into computerized text

If we put these all together, it’s easier to understand that AI is the overarching discipline that refers to making machines “smart.” Machine Learning refers to systems that can learn by themselves from experience. ML is not the same as AI, but rather is a subset of AI. Most AI work now involves ML because intelligent behavior requires considerable knowledge, and learning is the easiest way to get that knowledge.

One area of high achievement is ASR – automatically transcribing voice recordings into the written word. In the rest of this post, we will take a look at the applications in which ASR has been used (and how well) including – of course – for captioning.

Applications of ASR

For ASR to work, the machine must be programmed to predict and deliver all possible outcomes. We have seen successes in specific applications where the input space is very narrowly constrained. For example, digital assistants like Siri and Alexa work sufficiently well, as the vocab size is task and command. In other applications, larger vocabulary sizes have posed a challenge.

ASR for Captioning

Before we talk about captioning, think for a moment about self-driving vehicles. The notable failures in automated driving have been due precisely to unexpected visual input, which is also what happened with Google’s infamous image tagging error. If you think about programming a car to predict all of the possible situations it might encounter on the road and to adjust to those situations, it feels nearly impossible. The same is true for ASR. As the vocabulary size grows, the task gets more complicated. While there have been many new improvements in ASR, the captioning task is much more complicated than the tasks where the most improvements have been made.

In 2015 the National Association of the Deaf (NAD) sued Harvard University for allegedly failing to caption public online video content and for providing inaccurate closed captions where they did exist. Specifically, the lawsuit noted that “Much of Harvard’s online content is either not captioned or is inaccurately or unintelligibly captioned.” These “inaccurate” and “unintelligible” captions were produced by ASR.

We recently saw an example from the NASA launch of why automatic captions just don’t cut it. The images below show the incorrect automatic captions, along with what was actually said. While we agree the sight is phenomenal, inaccurate captions are certainly not!

Why Are ASR Capabilities so Different for Captioning?

When relying solely on ASR technology, the accuracy rates are pretty abysmal – but why? Captioning is much more complicated than many other applications of ASR. Captioning is primarily characterized by long-form content, often the speaker is unknown, and it’s critical to transcribe every spoken word.

When it comes to captioning, some of the most common causes of ASR errors include:

Speaker labels
Punctuation, grammar, and numbers
Non-speech elements
[INAUDIBLE] tags
Multiple speakers or overlapping speech
Background noise or poor audio quality
False starts
Acoustic error

ASR technology is also prone to fail on small “function” words which are important in conveying meaning in speech. For example, take a look at the two sentences below:

“I didn’t want to do that exercise.” vs. “I did want to do that exercise.”

This example is a very typical ASR error. However, the meaning is completely reversed. It is sporadic for a human – especially a trained editor – to make such an error, as they will use the context to “fill in” the correct meaning in spite of any noise that may have been responsible for the ASR failure. Many of the above challenges are not being addressed by current technology or by current research.

In cases of perfect audio conditions, we have seen ASR technology produce around 80% accurate captions, at best. Perfect audio conditions, however, are rare.

We Still Need Humans

As the saying goes, “If you want something done right, do it yourself.” (Or at the very least – have a human do it. 🤷‍♀️) All jokes aside, when it comes to purely automated captioning solutions, there are no solutions on the horizon that will address all of the challenges posed by the captioning task. Technology just doesn’t have the same capability that humans do to understand nuances or discern unclear words from context. Therefore, we should be careful about generalizing from the success of automated approaches to any particular business problem, such as captioning.

ASR can play an important part in captioning when used in conjunction with human editors. At 3Play Media, we use a 3-step process to provide a high-quality, yet cost-effective captioning solution.

First, your video will go through ASR technology to produce a rough draft.
Next, a human editor will clean up the rough draft using our proprietary software.
Finally, a quality assurance manager will conduct a final review to ensure 99% accuracy.

At 3Play, we feel confident that human editors will remain a necessary component of producing high-quality captions for the foreseeable future.

—

Get started with the highest quality captions in the industry or learn more about 3Play Media’s captioning solution today.

When to Use Audio Description on Your Content

by Kelly Mahoney in Video Accessibility

With anticipated updates to Title II of the ADA, ensuring accessible media has never been more top-of-mind. Audio description (AD) is one essential aspect of the upcoming requirements, and we’re here to help when it comes to narrowing down the what, when,…

December 20, 2024

Is AI Dubbing Worth the Cost for eLearning Localization?

by sofia in User Engagement

The eLearning industry is sitting on a massive opportunity for growth by tapping into non-English speaking markets around the world. Over 75% of the global population speaks languages other than English. By using AI dubbing to cost-effectively localize your video and audio…

November 15, 2024

NBC Sports and 3Play Media Team Up to Enhance Accessibility for 2024 Paris Olympics

by Jaclyn Lazzari in Case Studies

Learn More About Media Accessibility Solutions for Sports The 2024 Paris Olympics have wrapped up, leaving a legacy of athletic achievement and unparalleled accessibility in sports broadcasting. In a landmark collaboration, NBC Sports and 3Play Media exemplified a strong commitment to inclusivity,…

August 29, 2024

Product

Why 3Play?

Learn

Company

Further Reading

When to Use Audio Description on Your Content

Is AI Dubbing Worth the Cost for eLearning Localization?

NBC Sports and 3Play Media Team Up to Enhance Accessibility for 2024 Paris Olympics