'You sound quite angry today, what is it?'

| Enith Vlooswijk

‘Over a pint’ is a series about science. UT researchers talk to science journalist Enith Vlooswijk about their field and the misconceptions that exist about it. Enith turns their input into writing and drawings. In this twelfth episode: Arjan van Hessen, researcher in the field of speech recognition.

Ideally, Arjan van Hessen would like to develop a self-learning algorithm that turns the computer into a fine conversation partner. Not one that gives short answers to simple questions like 'what time is it' and 'what time does the supermarket close?', but one with whom you can discuss for a full half hour.

To me that seems like a really great next step: the computer not as a source of information, but as a source of sociability,' he says. It can hear whether I am happy, sad, or angry. It could then ask the question: gee, you sound pretty angry today, what happened?

Van Hessen, a researcher in the field of speech recognition, will have to wait a while. The technology has reached the point where a computer program can convert about 93 percent of a spoken utterance into written language. Under ideal conditions, that is. When a researcher's algorithm converts a recording of this interview into written text, the result still turns out to be quite difficult to read. 'Buying shoes' becomes 'cooking shame' and the name Enith sounds like 'one in it' according to the algorithm. Even if an algorithm does reach that 93 percent, there is still a big difference between spoken and written language, Van Hessen explains. Our hesitations, mid-sentence interruptions and sloppily pronounced words are a crime for any translation program. 'We rarely speak in grammatically correct sentences,' he says.

As a speech technologist, Van Hessen is a bit more aware of this than the average language user. 'Sometimes someone asks what my name is, but I pronounce the 'H' of 'Van Hessen very minimally.' You hear it when you know it's there. That goes wrong very often and then I think: that's how speech recognition goes. I realize how we communicate and what goes wrong.'

Now it is perfectly normal to ask Google for directions, but until 2010 it still seemed like far-fetched science, Van Hessen recalls. That year Microsoft presented speech recognition based on deep neural networks (self-learning algorithms). 'The performance was so much better than anything before that within a year just about everyone in the whole world embraced this method.' Meanwhile, everyday users of speech recognition technologies don't even dwell on the complexity of speech recognition technology, the researcher notes. 'It's an invisible technology and people expect it to work a hundred percent of the time.'

The next big step in his field is a program that not only deciphers words, but also filters out the speaker's intent from that verbiage. Only then could the computer really function as a conversation partner. Van Hessen understands that there may be ethical objections to this. In the past the computer was able to recognize the numbers zero to ten. No one loses sleep over that. Even now it often sounds so good that people don't realize they are talking to a computer. Some people think the system should make that known, others don't see the point. Those are pretty fundamental questions in my view.'

Stay tuned

Sign up for our weekly newsletter.