25 Years In Speech Technology

Matthew Karas
8 min readOct 26, 2020

…and I still don’t talk to my computer.

These are personal reflections and observations, some of which might seem opinionated or simply be mistaken. Feel free to comment, I probably won’t be offended.

While I was a post-grad at Cambridge in 1994, I was taught by Steve Young and Tony Robinson, who had created some of the best speech recognition systems in the world. However, the most important thing I learned in my first few days, was something I would never have guessed from seeing early versions of DragonDictate. They had already cracked it. A modestly powerful computer could convert continuous natural speech into text, as it was being spoken, with about 95% accuracy, i.e. the technology was already better than all but the best-trained professionals.

Why has it taken until the last few years for speech recognition to be adopted in day-to-day use? The technology has many hidden industrial applications, but as a real-time user interface for day-to-day use, i.e. talking to your computer, adoption has been unbelievably slow. When I was studying in the 90s, I read about a sort of reverse Turing test, which demonstrated one reason why. Volunteers believed they were talking to a computer, but responses were actually provided by a human being typing “behind the curtain”. The observations and subsequent interviews showed that, back then, people simply didn’t like it.

So, what’s the problem?

I’m sure that in part, it is just unfamiliarity, and so there is a generational effect. My kids talk to computers more than I do. However, there really are serious problems with speech as a primary user interface:

  • Privacy: do you want others to hear your search terms and messages
  • Immature tech: it isn’t quite there yet - go into a crowded coffee shop and say “Hey Siri …”
  • Time-based: you can’t scan for the good bits

However, when I left college, to work in the real world, I learned a couple things from working with speech recognition, which (luckily for me) were not obvious to all those people demonstrating and discussing speech recognition, who were still focusing on dictation.

Lesson 1: speech recognition isn’t for dictation

I fell into a career, developing scalable digital media products. Among other things, I led the development of BBC News Online. Then in 2000, I made a decision to apply my knowledge of speech technology, to solve problems in the media industry, and also to respect my presentiment that people still didn’t like talking to computers.

So, with funding from a big software company, I developed some products which revolved around applying speech recognition to recorded speech. This was surprisingly easy, because another company in the same group was SoftSound, founded by my old teacher from Cambridge, Tony Robinson.

I had been particularly interested in Tony’s lectures, and jumped at the chance of working with him on product development. He had succeeded in competing with the world’s best systems, but using much less memory and processing power, by using neural networks. In that sense, we were decades ahead of the crowd, most of whom moved towards neural nets in the mid-2010s.

My team took SoftSound’s speech recognition algorithms, and devised ways to combine them with video, text and image recognition, to create search engines for TV, film and radio archives. We created all kinds of cool things, like editing software allowing a video to be edited just by cutting and pasting the script. This was a little too far ahead of its time to be a hot seller, but we won some awards and got a lot of good press.

Lesson 2: people are easily put off

Seeing people use our speech-based search engines was a revelation. It taught me that people love to spot mistakes, and use them as a reason to dismiss even the most obviously useful of innovations. This was a similar phenomenon to the YouTube clips of Scottish people talking to early versions of Siri.

The TV archives we worked on had all kinds of background noise and music, and recognition accuracy fell from the laboratory rate of 95% to about 65–70%. Interestingly, this still allowed the search engine to find the right clips.

The problem was that if we showed users the transcripts in the results list, despite the fact that these contained their search terms, their eyes would be drawn to the errors (one or two in almost every line). However, the tech was working, and it didn’t take long to come up with a solution: instead of showing the full text, we showed a still image from each clip and a list of the matching words.

Instead of scoffing, it suddenly seemed like magic.

For me, it was a great use of the technology, compared to the dictation packages seen at every trade show. It was genuinely useful, and it didn’t rely on changing anyone’s behaviour too much. It extended what was becoming a ubiquitous human skill — searching for stuff by typing keywords — and applied it to even more stuff: videos as well as web-pages. Our standard demo involved searching for a key word within hundreds of hours of video, and hitting next over and over again, seeing the video jump to another and another and another person saying the entered word in different contexts.

Now, to give Nuance, and DragonDictate credit: by the late 1990s, they had created Dragon NaturallySpeaking, which no longer required the user to speak with gaps between the words, and quite soon after, they were selling their technology, like us, as a toolkit to integrate into any application.

Also, despite not being of interest to me, there were of course, all sorts of people who used speech recognition for dictation — professionals for whom dictation was already the norm, and a wide variety of people for whom keyboards are difficult to use.

This time next year, we’ll be millionaires

Whether at SoftSound, Entropic or Nuance, from the mid-1990s on, we used to joke year after year after year, that “Next year will be the big year for speech recognition”. Somehow that has finally crept up on us.

Siri, Alexa et al

The lessons I learned building real-world applications, are pertinent to the behaviour I have seen in the last few years. Many people still don’t like talking to Siri when they have enough fingers free to type instead. However, just like our success in extending search into new media types, Siri and its cohort have succeeded in extending search into new situations: driving, cooking, bathing the kids etc:

“OK Google … petrol station”

“Siri, how long should I roast a 2.4 kilo chicken”

“Alexa, play The Gruffalo on Audible”

That said, it is a full ten years on from the launch of Siri, and it is still not that easy to reroute a map, or to correct Alexa quickly, when Audible starts reading “50 Shades” to your kids.

Audio feedback doesn’t give the user the same reassuring sense of certainty as a graphical user interface. One glance will confirm that I have typed my card number correctly, but you don’t have to be unusually impatient for your heart to sink, when you hear the inhumanly calm words, “I heard 4659 1234 1234 1234. Is that correct? Say yes or press one to confirm”.

And as for errors, as well as Scottish accented YouTube clips, by 2016, there were much less jokey news stories, making claims that this was inherently racist technology. If Microsoft Office only worked for 90% of people, there would be uproar. Did this mean that speech recognition was just a novelty, rather than real product, empowering business etc?

However, neural nets really have come to the rescue, especially for this kind of problem. Having enough of the right training data turns out to be much more important than knowledge of the phonetic differences between accents — the network will work out what those differences are.

Even five years ago, we needed to train systems for each regional accent, but nowadays, Siri copes with Scottish accents, just by training its networks on Scottish people reading known texts, i.e. teaching the networks the various ways in which a word can be pronounced.

So, will speech replace keyboards and screens? Wrong question.

Computers have made multi-taskers of us all, and sometimes I think that, as an interface, even for interpersonal communications, speech can sometimes set us back: I can be in several text chats at once, but I can’t be on two voice calls. Text and screen interactions have some real advantages, with which speech shouldn’t even try to compete.

However, for speech technology to reach the potential of what it, uniquely, can do well, it still has much further to go. This is good news for the industry, as more and more startups are funded to solve real-world problems, not dealt with by the big players.

Technology has to get as good at listening, and at speaking, as human beings are, and then — in some contexts — get better than we are. Here are a few examples from projects, which I and others have been working on lately.

Away from our headsets, speech isn’t really as linear as I have made out. In close proximity to someone speaking, I might whisper a comment to another listener, and still go unheard by anyone else. At a dinner party, I might be involved in more than one conversation at a time, because it is easy, in the 3D space of the real-world, to keep track of who has said what, and to control the volume and direction of my speech to target a specific listener.

Technology to separate speech from different speakers is coming on in leaps and bounds. This is achieved both by analysing the speech more deeply, and by combining the audio data with other sources, like using multiple microphones to measure relative volume and direction, or by using input from cameras to add lip movements and facial expressions to the mix.

In 2016, Google came up with a new approach to speech synthesis, using WaveNet, a neural network, which can be trained to generate almost any kind of sound, and then training it with real human speech. Once trained, it can be fed with quite robotic synthesized speech, and then make it sound human.

Nowadays, the latest developments are routinely shared, and the whole industry takes the latest ideas from Google, NVIDIA, Microsoft and a global community of university researchers, and with their blessing, extends them and applies them in new contexts, adding expertise from their own niche professions.

I have spent a lot of time working on systems which analyse accents, mispronunciations and speech impediments. Some people are difficult to understand because they have an unfamiliar accent, or are only just learning a language. We can make it easier to master pronunciation by giving them real-time feedback, but maybe we needn’t bother: morphing accents, and correcting errors in real time are both becoming a reality.

Speech recognition saves the human race!

Speech does not only vary by accent, but also by emotional and physical state. When a condition makes someone unintelligible, it is feasible not only to improve intelligibility but to identify what is wrong, perhaps categorising emergency calls, where the speaker is affected by stroke, sedation, drunkenness, concussion, or merely identify that the caller is a child, or speaks a particular language.

Finally, early identification of certain serious long-term neurological conditions is possible by monitoring subtle changes to speech. This can be done without hospital visits or even without targeting those who are at risk. Conveniently for all concerned, we all speak into our phones and computers all the time, so it would only be necessary to opt in, and give permission for your voice to be analysed, without compromising confidentiality by being recorded or listened to.

With the right training data, perhaps the same technology could be trained to recognise whether your cough, is in fact a new persistent dry cough.

--

--

Matthew Karas

Over the last 25 years, I have combined developing global scale media production technology, and advanced R&D in speech analysis and rich content search.