Measuring Captioning Accuracy: Why WER and NER Analyses Differ

February 28, 2024 BY REBECCA KLEIN

Captioning Best Practices for Media & Entertainment [Free eBook]


When it comes to measuring captioning accuracy, there’s no shortage of errors that need to be considered: punctuation, grammar, speaker identification, capitalization, and word errors, to name a few. But what does it mean when a captioning vendor says their captions are 99% accurate? 

It turns out that “99% accuracy” can mean very different things depending on the model a vendor uses to measure said accuracy. Different vendors use different measurement models, which can contribute to confusion when percentages are marketed to describe the accuracy of closed captions. In this blog, we’ll discuss the NER model and how it differs from two commonly used measurements, Word Error Rate (WER) and Formatted Error Rate (FER).

The NER Model

The NER model, which originated in Europe and is often used in Canada, differs from the accuracy measurement rates commonly used in the United States. In the U.S., all errors—including spelling, punctuation, grammar, speaker identifications, word substitutions, omissions, and more—are considered to obtain a percentage that measures the average accuracy of the closed captions on a piece of media. 

In contrast, NER scoring emphasizes meaning and how accurately ideas are captured in captions, making it an extremely subjective and legally risky measurement. For instance, the FCC closed captioning guidelines state, “In order to be accurate, captions must match the spoken words in the dialogue, in their original language (English or Spanish), to the fullest extent possible and include full lyrics when provided on the audio track.” More specifically, the guidelines require captions to include all words spoken in the order spoken (i.e., no paraphrasing). Considering the legal requirements of live and recorded captioning, the subjectivity of NER scoring makes it an inherently risky method.

How are NER Scores Calculated?

Vendors grade each caption error based on its severity or resulting understandability when using the NER model. In many cases, vendors decide for themselves what constitutes a critical error. This subjectivity means that a caption file could get different NER results depending on who scores the file—contributing to significant liability for customers. 

The NER Calculation
NER Score = (Words – NER Deductions) / Words * 100

One of the reasons NER scores get inflated so quickly is that the denominator of the NER equation is all of the words written. However, the numerator, which is the number of correct words, also starts at the total count of words and is only deducted by fractions of certain words, even if a whole sentence is paraphrased or several words are wrong in sequence. In addition, the denominator is the total number of words captioned, not the total number of words that should have been captioned based on verbatim dialogue.


 Discover Captioning Best Practices for the Entertainment Industry ➡️ 


Types of NER Errors

NER errors are categorized under two main types, each with corresponding deduction values of either 0.0, 0.25, 0.5, or 1.0 (a full point deduction). In this way, the NER model functions more as a score than a percentage. Caption scoring begins at 100 and is graded according to the number of errors and their assigned score deductions. 

Of note, the NER marking of “Correct Edition” indicates that paraphrased captions capture the full meaning of the spoken content. However, a Correct Edition marking might have a starkly decreased WER score with no deduction in the NER score. At 3Play Media, we see many examples of this difference, which is consequential for accessibility and legal compliance with FCC standards and other legislation.

NER vs. WER: Different Measurements Provide Different Results
In conducting market research, we scored a Canadian government meeting transcription using the NER method and received a score of 99.00 (or “very good”) because the captioner used a high degree of paraphrasing that was “mostly successful.” However, when we scored the same meeting using the WER method, we received an accuracy rating of 93.2%, which is not legally compliant under the FCC due to the number of captions that were paraphrased compared to the verbatim speech. We plan to conduct further research to analyze the measurement challenges of NER vs. WER.

Edition errors represent the loss of an idea unit or piece of information. They include:

  • Critical Error (False Information): An editing or paraphrasing error provides false but plausible information (-1.0)
  • Major Error (Loss of Main Point): Inaccurate captions lose the main point of an idea (-0.5)
  • Minor Error (Loss of Detail): Inaccurate captions keep the main point but lose a detail (-0.25)
  • Correct Edition: Paraphrase captures the full meaning (0.0)

Recognition errors represent misrecognition of the spoken content. They include:

  • Critical Error (False Information): A wrong word, phrase, or punctuation error provides false but plausible information (-1.0)
  • Major Error (Nonsense Error): A wrong word, phrase, or punctuation affects comprehension of an idea. (-0.5)
  • Minor Error (Benign Error): A wrong word, phrase, or punctuation affects readability but not comprehension. (-0.25)

Word Error Rate (WER) and Formatted Error Rate (FER)

More commonly, captioning accuracy for recorded content is often made up of two pieces: Word Error Rate (WER) and Formatted Error Rate (FER). WER is the standard measure of transcription accuracy and considers the number of inaccurate words versus the total number of words. In contrast, FER is the percentage of word errors when formatting elements such as punctuation, grammar, speaker identification, non-speech elements, capitalization, and other notations are taken into account. 

For closed captioning, the FCC mandates all of these formatting requirements to achieve at least 99% accuracy for recorded content. For recorded and live content, the FCC quality standards do not permit the same amount of flexibility that the NER model allows. While live captioning does not have firm accuracy standards and instead relies on best practices, the FCC still focuses on accuracy, synchronicity, completeness, and placement—which are more aligned with WER and FER than NER. 

WER and FER vs. NER: Unequal Measures of Accuracy and Quality

Compared to WER and FER, NER is not an equivalent measure of accuracy or quality. While captioning with a high NER score may be useful for viewers who value overall meaning instead of absolute accuracy, higher WER and FER measurements are essential for d/Deaf and hard-of-hearing viewers and legal compliance.

Some vendors do not state which model they’re using in determining accuracy for recorded captions, which is a misleading practice and can put your content at risk for litigation. When evaluating a potential vendor, you should always inquire about the models they use to determine the accuracy of their captions.

Additionally, NER scoring is more beneficial for live content and less applicable to recorded content, so be wary when a vendor uses NER to describe accuracy for recorded captioning. There are inherent challenges in captioning live content, and recorded captions should be measured differently because the captioner has more time and can perfect the verbatim transcription. NER scoring, if used, should always be near perfect for recorded content because a recorded captioner should never need to summarize the spoken content—captions should achieve verbatim accuracy and, by doing so, retain meaning.

Ultimately, when offering closed captioning as an accommodation, the best practice is often to provide an equitable experience by presenting content as spoken, which necessitates using WER and FER to measure accuracy.


Closed Captioning Best Practices for Media and Entertainment: Read the eBook