The White House Transcripts and The Zelensky-Trump Transcription Process

Feb 12, 2020 | Jonathan Maisel

According to The Washington Post, the conversion of the telephone conversation between Prime Minister Zelenskyy and President Donald Trump into a transcript did not follow best practices of the transcription industry.  It is a shame that the White House does not have the best technology and uses error-prone processes to properly document their meetings. The 33 minute call resulted in a 5-page document indicating a speech rate of at least 60 words a minute.  It is clearly highly edited and formatted including annotations, underlying and punctuation that was unlikely dictated. It is common for interpreters to improve what was said with grammatical and improved readability of the intent of the speaker and rarely are verbatim transcripts requested.

As described, other people also listened to the call.  One of those listeners “redictated” the conversation to use a speech recognition process.  Typically, traditional speech recognition turns speech into text automatically. It usually then saves the audio file and pairs it with the generated text in a linked file.  These can be separate files that may have an independent life cycle. Speech is usually saved until the transcript is complete and finalized and then discarded but it can be archived and saved indefinitely.  Voice recognition identifies a speaker by comparing it to previously recorded audio files of known people. 

Redictation is an older speech recognition method rarely used today, called “shadow masking”  that was sometimes used by court reporters using speech recognition software. The software they used achieved higher speech recognition accuracy because they previously trained it with their voice to create an audio acoustic profile.  These older speech recognition engines, limited by computing speed, also used a language model that contained words and combinations of words to improve the accuracy from a limited vocabulary. Modern speech recognition software coupled with faster processing is so good at speech recognition that it usually does not require an acoustic profile, user training or language models.  By having an experienced speech recognition user redictate, accuracy could be improved. 

If the “shadow” dictator could repeat the text accurately and quickly, they could also dictate punctuation and identify each speaker. Obviously, one problem is that it is difficult to listen and accurately repeat everything especially if the dictation is fast.  To make matters worse, some dictation engines tend to skip words, phrases or even sentences when the rate of speech exceeds 160 words per minute. Many newscasters exceed this rate and only exceptional captioning software or humans can keep up. 

Since the above process is potentially so error prone, multiple other “trusted” people took notes and attempted to improve the content of a final summarization of the call.  It is no wonder that so many questions have arisen about the accuracy, completeness and veracity of the available final call summary.

There also are some claims that the speech recognition process might be prone to hackers.  Some speech recognition software, even when run locally on a PC do send the audio and speech recognized text to servers for improving quality or for other reasons.  Other speech recognition companies are sent the audio file and perform speech recognition on their servers. One public medical transcription company was affected by the NotPetya virus and was shut down for several weeks in 2017.  Other companies using humans overseas editors in-the-loop to assist in the post recognition process and also have had security breaches.

What would be the ideal alternative?  On a daily basis, ZyDoc is entrusted with thousands of dictations that need to be securely, accurately and quickly turned into transcripts. We perform this work under HIPAA and HITECH regulatory laws requiring physical, operational and procedural security and confidentiality using background checked and security trained U.S. employees. Security, confidentiality and privacy is expected in this work and required for sensitive work performed for governmental agencies, prisons, medical-legal work, psychiatric institutions with audit logs and job tracking. 

The White House could use ZyDoc to take the audio signal from calls and securely send it through HTTPS secure encrypted channels to ZyDoc servers in government approved SOC2 data centers. Speech recognition could be performed with separate speaker segmentation with high accuracy on a live audio file in near real time or on saved audio files.  Our technology accommodates fast dictators such as radiologists or news announcers. It also can enhance the accuracy, punctuation and formatting of the speech recognized text. Human subject matter expert editors take these important transcripts and produce 99.6% accurate documents. The entire process is measured for accuracy at every step with a quality assurance process.  Text and audio can be indefinitely or follow a prescribed life cycle. Once finalized, audio files are not necessarily saved indefinitely.

James M. Maisel, M.D.

Founder and CEO


Tired of Typing In Your EHR?