An algorithm scrambles listening in real time

There was a time when fountains, judiciously placed in palaces or gardens, allowed conversations to be held away from prying ears. Our connected world requires even more precautions. Have you ever received an advertisement on one of your connected objects on the specific subject you just mentioned? If so, have you ever wondered if your nearby phone, PDA, smartwatch or computer could be “spying” on you without your knowledge?

It may still be a fictional scenario, but we are now surrounded by a multitude of microphones whose recordings can be analyzed by machine learning. Since these artificial intelligence (AI) programs learn to “understand” voices, technologically, this spying is possible. “A very large amount of personal data is already used by machine learning. You have to, in a way, give the power back to the user”, explains 24-year-old Franco-American researcher Mia Chiquier. With two other AI specialists from Columbia University (United States), Chengzhi Mao and Carl Vondrick, the scientist announces that she has found a way out: an almost inaudible and real-time voice camouflage that prevents, in “80% of cases”, the effectiveness of espionage, “even if nothing is known of the position of the possible microphone in space “, she explains. The results of this work entitled « Real-Time Neural Voice Camouflage” was published on ArXiv on February 16 and presented at the prestigious International Conference on Learning Representations (ICLR) on April 25.

Since 2018, several works have already focused on voice camouflage. “It is, each time, an algorithm which will try to deceive another by adding an intelligent noise, which the profession calls adversarial attack », explains Mia Chiquier. But, so far, “algorithms that attack so-called “ASR” algorithms [automative speach recognition, qui traduisent la voix en texte] needed to listen to the whole sentence of a speaker to analyze it and then scramble it”. Logically, these softwares could not be effective in the case of a use in real time, since their answer, the intelligent noise, arrived too late.

Predictive attacks

For live camouflage, the researchers had to imagine a conversation before it took place… A challenge they took up by developing a novel approach: the creation of “predictive attacks”. Their machine learning software (also baptized NVC), which uses deep neural networks, only needs two seconds of the human voice to “understand” it and then predict the possible sounds that will follow. Almost instantaneously, NVC then predicts an attack that will scramble these possible sounds and disrupt the automatic speech recognition models that are trained to transcribe our speech. And to, perhaps, spy on them.

You have 34.01% of this article left to read. The following is for subscribers only.