Microsoft's New AI Can Simulate Your Voice in 3-Seconds

This is big news, right?

However, Microsoft has been unusually quiet about the new artificial intelligence. There were no press releases or other significant announcements this week.

VALL-E has sent ripples throughout the tech community. Despite the company's uncharacteristic marketing reluctance in regards to the launch, the implications of the neural codec language model are undeniable.

The researchers' paper, entitled "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers," is a tour de force of technical jargon and acronyms. Still, the abstract succinctly states the capabilities of VALL-E: "with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt, VALL-E can synthesize high-quality personalized speech."

In layman's terms, Microsoft has developed a tool that can mimic speech with unprecedented accuracy, requiring only a brief recording of a person's voice to do so. The implications of this technology are both impressive and alarming.

But, What Does This Mean?

On the one hand, the potential for this technology to revolutionize industries such as customer service and podcasting is staggering. On the other hand, the prospect of scammers and cybercriminals utilizing this technology to impersonate individuals is a genuine concern.

At the moment, it's impossible to know just how good VALL-E is since Microsoft has yet to release the tool to the public, although it has provided samples of the work that's been done. It's very impressive if that mimicry took only three seconds, and the voice could go on to speak for any length of time.

If it's as good as Microsoft says it is, while providing human characteristics like charisma, one could see why Microsoft is reportedly in talks to invest $10 billion into OpenAI LLC's ChatGPT.

VALL-E's Scammer Concerns

Microsoft trained the new VALL-E TTS system on 60,000 hours of English language speech. The tech firm used Meta's LibriLight audio library, which has over 7,000 audio recordings.

Surprisingly, the TTS tech can copy diction and speakers' speech. Most of VALL-E's audio is so similar that you won't notice any differences from the original ones.

This is where the problem starts.

The potential for scam artists goes through the roof. If a scam artist can get you to talk on the phone for three seconds, they are able to steal your voice. Imagine if they called your grandma, or bypassed a voice-recognition security device.

You'll be relieved that the researchers may have spotted this potential for discomfort. Microsoft states: "Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker."

What's The Solution?

Researchers say that it is building a detection system.

This may leave people wondering: "Why did you do this at all, then?"

Quite often in medical and technological advancements, the answer is: "Because we could."

Creative AIs like DALL-E, ChatGPT, various deepfake algorithms, and countless others makes it feel like we are at an inflection point where these technological advances are breaking out of laboratories and into the real world.

As with all change, it brings exciting opportunities along with risks. We truly live in interesting times.