While I research biometrics for mobile devices, we’ve seen a boundary-value use case in our own household. A member of our extended family has recently moved to Texas and is attempting to acquire a Texas driving license. Although English is not her first language, she speaks it excellently, with an extensive vocabulary and grammar that would shame most high school students. However, she does have a noticeable accent.
Meanwhile, the Texas Department of Public Safety (DPS) uses an online voice authentication system for obtaining a driving license. To enroll, the user must repeat a sequence of numbers to the voice recognition software. And this is where the problem arose. She repeated the number as requested – numerous times – and we could understand what she was saying but the DPS system could not. Therefore, she is unable to use the online system – and instead, she must call the help desk. This means more time for her and more expense for DPS.
Why is that? Who cares if she meets the software’s expectation of what a specific number – or word – should sound like? Authentication requires that, for a specific prompt, the user responds with an expected action. Failure to respond with the expected action results in a rejected login. As long as her voice sounds the same in the future as it does now, why does it matter if her pronunciation of five sounds like something the machine expects to hear?
It should not matter whether or not the user repeats words that the software can understand. As long as the response is the same as at enrollment, in the same voice, then she should be authenticated. If anything this allows for far stronger authentication. Suppose the voice recognition system asks the user to say, “My voice is my password.” Why can’t she respond with, “My hovercraft is full of eels”? If anything, that is even stronger security. It’s hard enough to emulate someone else’s voice – imagine that a hacker must also guess what phrase the user actually said. Somebody tell me how to brute force attack that!
This seems really simple. To enroll a new user: prompt her and store the response. To authenticate: prompt her and compare the new response to the stored response. If they match within specified tolerances, authenticate. If they don’t match, don’t authenticate. It’s likely that if an attacker (or his voice software) says, “My voice is my password” while the system is expecting to hear something about a hovercraft, then the attacker will be rejected.
The options range even wider. Imagine that a user is prompted to repeat the phrase, “My voice is my password,” but the only acceptable response is for the user to place her fingerprint against the touch screen and the software performs a fingerprint match rather than a voice match. Or vice versa. We have in our hands the capability to make authentication hacking immensely more difficult, but instead we restrain ourselves by an arbitrary set of rules. It’s okay to deviate from expected behavior when you’re trying to keep the bad guys out.
Again, the sequence is simple: issue a prompt, record the response. The response need not have any logical relationship to the prompt. It just needs to be repeated later on when authenticating. Ask, and test for the expected action. That’s all there is to it.