Pros and Cons of Synthesized Speech for Audio Description

May 9, 2017 BY ELISA LEWIS
Updated: September 13, 2024

Traditional audio description providers use humans for the entire workflow. In most cases, description writers create the description transcript, then human voice actors deliver the description. However, this method is extremely costly and quite time consuming. One way to streamline the audio description process and bring down costs is to leverage current technology. At 3Play Media, we’re doing exactly that. We have created several customizable features to create synthesized speech that is more humanlike, thus allowing us to take advantage of all that this technology has to offer without sacrificing quality. Our customizable settings afford our customers all of the pros of synthesized speech, while counteracting some of the downfalls, to provide the best of both worlds.

Pros of Using Synthesized Speech for Audio Description

Despite some shortcomings, there are many pros to using synthesized speech for audio description. With synthesized speech, production time is cut significantly, costs are much lower, and the customer is in complete control.

Shorter Production/Turnaround Time

With traditional methods, audio description can take up to several weeks to produce. When using human voice actors for description, it can be very time consuming to conduct the recording – often done in a studio – and then edit the audio. The editing process can be extremely time intensive, especially when working with longer content.

In our system, audio description can be completed much faster, as the use of synthesized speech allows us to skip the entire process of voice recording and editing. Skipping these tedious steps allows us to produce audio description files much faster. If a transcript file already exists for your content, audio description can be completed in five business days from the time the order is placed. If there is no existing transcript, audio description will be completed five business days after the transcript deadline.

Lower Cost

Audio description is typically very costly. Traditional methods of creating audio description use humans for the entire workflow. In most cases, description writers create the description transcript, then human voice actors deliver the description within the specified timecodes. This method requires many people, a great attention to detail, expensive equipment, and a high level of skill.

3Play Media is taking a new approach to audio description, using a combination of humans and technology in every step of the description process. By employing technology, we’re able to make the process of writing and time coding description easier, faster, and more cost-effective.

We use certified human describers to write high quality descriptions, then utilize synthesized speech to vocalize these descriptions. In addition, one of the hidden costs of audio description is the resource-intensive process of publishing audio description. Again utilizing technology, we’ve developed a plugin to alleviate the need to produce a second version of your video with audio description. Using a combination of human editing and advanced technology, the cost of audio description can be significantly decreased without sacrificing quality.

User Control

With synthesized speech, the user is in complete control, as the synthesized speech will vocalize the exact description written by the human describer. In addition, if you decide that you would like to make a tweak to the description, you can do so in the 3Play Media account system. Simply login to your 3Play account, preview your described audio description file, select “edit” to make your changes, then submit it for finalization. You will receive your finalized file back in your account when it’s done processing.

When using voice actors to describe, it can become very difficult and costly to even change a simple word. In order to do so, you would need to find the voice actor who did the original description and pay them again for the additional time. Then, you would have to wait for the file to be completed, delaying your files significantly.

In addition to adjusting or changing content, synthesized speech allows you to control the rate of speech and the voice that is used, as you can select the voice and dialect from several different options.

Familiarity

The sound of synthesized speech is familiar to most blind users, as they are accustomed to more mechanized screen reader voices. The familiar sound makes synthesized speech for audio description the ideal tool. Not only are blind individuals used to the mechanized voice, but they are also able to adapt to very fast-paced audio.

Cons of Using Synthesized Speech for Audio Description

Lack of Tone and Emotion

Synthesized speech has typically been thought to lack tone and emotion. Over the years synthesized speech has improved greatly, sounding less robotic and increasingly humanlike. In addition to the rapid advance of technology, at 3Play we’ve taken several measures to increase emotion and tone. By doing so, we’ve made synthesized speech a true solution for audio description.

We have created speaker customization features that allow you to select the voice and rate that best fits your needs. In the Audio Description Settings page of the 3Play Media account system, you have the option to select from several speakers with various dialects, as well as several options for speaker rate. You can even sample the options before proceeding. This gives you the control to decide if you would prefer a speaker that is easily distinguishable from the contents’ main speaker or dialogue, or if it would be preferable to select a voice that flows naturally within the original audio. We can continue to improve tone and emotion in synthesized speech by using features like “say as ‘concerned'” or “say as ‘happy.'”

No Subjective Judgements

When using synthesized speech, there is no room for subjective judgments, such as there might be when using a voice actor. For example, when a voice actor is describing, they may substitute a word that they perceive is preferable to the initial word chosen. With synthesized speech, the computer will read the words exactly as they appear in the description, and will not be able to adapt to any unexpected situation that may arise. At 3Play Media we want to ensure our customers are satisfied with the description of their files. That’s why we provide an edit tool to allow users the option of editing their own files once our describers have completed them. This way, even with synthesized speech, you have the freedom to make your own judgement calls.

Pronunciation

When utilizing synthesized speech, there are some issues with pronunciation. After all, it is a machine speaking rather than a human. However, at 3Play, we have addressed this issue with phonetic pronunciations, which can be built into the synthesized speech to improve the pronunciation. As technology continues to develop, this will continue to improve as well.

The benefits of using synthesized speech, combined with the the features available in the 3Play account system allow us to provide a high quality, cost-effective, and fast solution to video accessibility.

Get started with audio description today!

Originally published May 9, 2017, updated September 22, 2017.

Standards and Requirements for Quality Audio Description

by Aj Beltis in Video Accessibility

Providing a verbal description of on-screen elements and visual details in media, audio description gives individuals with low vision the context they need to understand and enjoy video content. Think of it as the equivalent of captions, but for those who are…

March 10, 2025

Tuned ASR: How 3Play is Advancing Live Automatic Speech Recognition for Closed Captions

by Jena Wallace in Video Accessibility

Captioning Best Practices for Media & Entertainment [Free eBook] When it comes to live events, choosing automatic speech recognition (ASR) for captioning may seem like an easy, inexpensive accessibility option. Unfortunately, relying solely on ASR can result in embarrassing captioning moments and…

April 6, 2023

3Play’s Real Time Captioning Services for Higher Ed

by Abby Alepa in Product Updates

How to Select the Right Closed Captioning Vendor: 10 Crucial Questions to Ask [FREE eBook] Time spent in school is some of the most formative of our lives. We go to school to get an education, foster relationships and community, and prepare…

Updated February 27, 2024

Localization

Accessibility

Platform

Further Reading

Standards and Requirements for Quality Audio Description

Tuned ASR: How 3Play is Advancing Live Automatic Speech Recognition for Closed Captions

3Play’s Real Time Captioning Services for Higher Ed