There’s a very interesting discussion this topic on IXDA. I have summarized the main reasons below
- Imagine 4 people in a small office all talking to their computers every 2 seconds to say “new window….scroll down… stop…up… select file….”. Imagine how noisy a cube farm of just 20 people talking to their computer would get
- Us humans rely heavily on being able to communicate. Our survival as a species depends on it, and our success is a direct result of the ability we have to understand each other.
We are hard-wired to be really upset when we cannot make ourselves understood. At the gut-level, mis-communication is a threat, and so when the system doesn’t understand us, we lose trust in it’s ability to help us
- The main problem with voice is still social/ psychological. How do you talk to a machine?
I’ve looked at a bunch of Sync videos on Youtube – people are obviously feeling uneasy talking to their car
- It creates more cognitive load to verbalize what you want something on screen to do and then say it, then confirm that it has worked;
- Humans work better by recognition rather than recall. Visual UI’s aid recognition, while voice UI basically requires good recall. You’d have to remember the exact command that’d generate desirable response
- It is essentially serial, as opposed to visual UI which is parallel. This is one of the biggest drawbacks of voice based interaction with a computer. This is one of the reasons why I think the iPhone’s visual vmail was such a hit. In this respect, the computer would really need to get to the level of a human-human interaction – just “knowing” when to interrupt and when to get interrupted in order to carry a serial interaction with almost parallel efficiency
- Dealing with accents, sound levels, ambient noise etc.
- The computer would need to understand what we ‘mean’ as opposed to visual UI where we click what the computer has to offer
- Recognizers usually tend to miss-recognize short words that would feel intuitive to the user, such as “back” and “next” and “stop” What you are left with as the a designer is “Go back, Play next, Stop now” – words that consumers would never think to say, and frankly irritate them.
- Let’s assume though that they do make the effort to learn the keywords, and are alone (or ignore the folks at the office). They open their mouth wide and say “Plaaay Neeeext.” only to be faced with their worst fear: “I’m sorry, I couldn’t understand that.”
- The system faces problems about accuracy because of background noise
- I find interactive voice response difficult at best, but frequently infuriating. As many people have been indicating, error rates are high, and what you intuitively think you need to say to get the response you need is not necessarily the command that the response system requires in order to get that action. My own experiences with interactive voice response have generally ended with me trying to usurp the system by pressing the * key repeatedly, which does usually boot you out of the system and land you on the phone with a real live human
- Even at 99.99% voice recognition reliability (plus the absurd 100% natural language parsing reliability we see in the movies) , every command interaction that involves a non-trivial, unrecoverable change in state is going to require a confirmation phase: “I think you said ‘Go Left’. Is that correct?”
One-way auditory signals are a great thing, even under high-stress conditions. Two-way auditory communication requires a mix of trust and half-duplex hand-shake negotiation, and that last bit is the deal- breaker for unreliable computer voice recognition.
I think Voice UI will not become primary mode of interaction in the near future for obvious reasons. It’ll be used mostly when Visual UI is difficult to use i.e while driving a car, taking care of babies, performing surgical procedure etc
What’s your take on this?
Note: These are only some of the responses. You can visit the original IXDA post to view all the responses