I utilized A.I. voice-cloning software. I liked the sounds, but I needed feedback.
I recorded the first 12 paragraphs of this essay as a test. Seven randomly selected paragraphs were my voice, while five were created by A.I. I requested my relatives to distinguish.
Mother was baffled. “All the paragraphs sounded like you,” she informed me afterward. She suspected computer-generated sounds. She only identified five paragraphs properly.
My relatives did better. My wife, sister, brother, and mother-in-law nailed all 12 paragraphs. Dad scored 10/12.
My ego suffered when I opened the experiment to the internet (try your luck here).
“The real voices had much more richness and emotional flavor,” one anonymous participant remarked. “The A.I. voices sounded like a grumpy cold sufferer. I hope that’s true! I’ve never seen you.”
That “mopey person with a cold” was me. The A.I. My voice lacks tone and cadence.
My grad school friend predicted incorrectly 11 times. 10 of 12 former employees were wrong.
Only 54% of non-friends guessed right. Listen to the findings, with speakers identified:
Despite its flaws, my replicated voice was impressive. Making it was inexpensive and simple.
Three Years Have Improved Voice Cloning
In 2020, MIT researchers and Respeecher created a phony video of Richard Nixon revealing the Apollo 11 Moon landing disaster. A behind-the-scenes film displays Nixon’s voice cloning. MIT researchers collected hundreds of brief Nixon audio snippets and had a voice actor record himself saying the same words. The actor recited Nixon’s alternate moon landing address and the program made him sound like Nixon.
Last year, Respeecher earned a contract to clone James Earl Jones as Darth Vader for future Star Wars movies. It’s expensive. “A project usually takes several weeks with fees from 4-digit to 6-digit in $USD,” Respeecher told me when I tried their service.
Instead of spending thousands, I chose Play.ht, a little business. I uploaded a 30-minute video of me reading my choice of text and waited a few hours.
I didn’t require a voice actor because Play.ht uses text-to-speech. My voice trained the machine to create genuine human speech from printed text in minutes. It was free. Play.ht’s free plan cloned my voice. Commercial plans start at $39/month.
Realistic text-to-speech systems like Play.ht are difficult to design since people pronounce words differently based on context. Depending on what comes before or after a word in a phrase, we follow intricate, mostly subconscious rules regarding which words to emphasis.
Humans speak words randomly too. Sometimes we halt, contemplate, or get sidetracked. Any system that pronounces words and sentences the same way will sound robotic.
Because Respeecher follows the voice actor’s lead, it doesn’t have to worry about these concerns. The A.I. system must comprehend human speech to know how long to pause, which words to accentuate, etc.
Play.ht employs a transformer, a neural network built by Google in 2017 that underpins several generative A.I. systems. systems since. GPT, OpenAI’s huge language model family, stands for transformer.
Transformer models are strong because they can “pay attention” to several inputs at once. Play.ht’s model considers the entire phrase while generating audio for a new word. It may adjust speech speed, intensity, and other features to match the cloned voice.
Text-to-Speech Voice Cloning Challenge
Play.ht is for creative people who make podcasts, audiobooks, instructional videos, TV advertising, etc. Compared to Descript, a powerful audio editing application, the startup is an underdog.
Descript’s 2017 version automatically transcribed audio. Descript would remove words from the transcript and the audio file.
Descript incorporated Lyrebird, a voice-cloning startup, in 2019. Since 2020, Descript’s Overdub function lets you add words to a transcript and produce realistic audio of your voice reciting them. Like Play.ht, Overdub needs a long voice sample to train.
To test Overdub, I prepared another 12-paragraph audio clip using Descript and asked relatives and friends to identify my own voice and Overdub-generated paragraphs. This wasn’t a scientific trial, but Play.ht’s cloned voice appeared more believable than Descript’s Overdub technique. Overdub’s output vs. my voice:
Understanding AI, a newsletter on A.I. how it affects our world.
I utilized A.I. voice-cloning software. I liked the sounds, but I needed feedback.
I recorded the first 12 paragraphs of this essay as a test. Seven randomly selected paragraphs were my voice, while five were created by A.I. I requested my relatives to distinguish.
Mother was baffled. “All the paragraphs sounded like you,” she informed me afterward. She suspected computer-generated sounds. She only identified five paragraphs properly.
My relatives did better. My wife, sister, brother, and mother-in-law nailed all 12 paragraphs. Dad scored 10/12.
My ego suffered when I opened the experiment to the internet (try your luck here).
“The real voices had much more richness and emotional flavor,” one anonymous participant remarked. “The A.I. voices sounded like a grumpy cold sufferer. I hope that’s true! I’ve never seen you.”
That “mopey person with a cold” was me. The A.I. My voice lacks tone and cadence.
My grad school friend predicted incorrectly 11 times. 10 of 12 former employees were wrong.
Only 54% of non-friends guessed right. Listen to the findings, with speakers identified:
Despite its flaws, my replicated voice was impressive. Making it was inexpensive and simple.
Three Years Have Improved Voice Cloning
In 2020, MIT researchers and Respeecher created a phony video of Richard Nixon revealing the Apollo 11 Moon landing disaster. A behind-the-scenes film displays Nixon’s voice cloning. MIT researchers collected hundreds of brief Nixon audio snippets and had a voice actor record himself saying the same words. The actor recited Nixon’s alternate moon landing address and the program made him sound like Nixon.
Last year, Respeecher earned a contract to clone James Earl Jones as Darth Vader for future Star Wars movies. It’s expensive. “A project usually takes several weeks with fees from 4-digit to 6-digit in $USD,” Respeecher told me when I tried their service.
Instead of spending thousands, I chose Play.ht, a little business. I uploaded a 30-minute video of me reading my choice of text and waited a few hours.
I didn’t require a voice actor because Play.ht uses text-to-speech. My voice trained the machine to create genuine human speech from printed text in minutes. It was free. Play.ht’s free plan cloned my voice. Commercial plans start at $39/month.
Realistic text-to-speech systems like Play.ht are difficult to design since people pronounce words differently based on context. Depending on what comes before or after a word in a phrase, we follow intricate, mostly subconscious rules regarding which words to emphasis.
Humans speak words randomly too. Sometimes we halt, contemplate, or get sidetracked. Any system that pronounces words and sentences the same way will sound robotic.
Because Respeecher follows the voice actor’s lead, it doesn’t have to worry about these concerns. The A.I. system must comprehend human speech to know how long to pause, which words to accentuate, etc.
Play.ht employs a transformer, a neural network built by Google in 2017 that underpins several generative A.I. systems. systems since. GPT, OpenAI’s huge language model family, stands for transformer.
Transformer models are strong because they can “pay attention” to several inputs at once. Play.ht’s model considers the entire phrase while generating audio for a new word. It may adjust speech speed, intensity, and other features to match the cloned voice.
Text-to-Speech Voice Cloning Challenge
Play.ht is for creative people who make podcasts, audiobooks, instructional videos, TV advertising, etc. Compared to Descript, a powerful audio editing application, the startup is an underdog.
Descript’s 2017 version automatically transcribed audio. Descript would remove words from the transcript and the audio file.
Descript incorporated Lyrebird, a voice-cloning startup, in 2019. Since 2020, Descript’s Overdub function lets you add words to a transcript and produce realistic audio of your voice reciting them. Like Play.ht, Overdub needs a long voice sample to train.
To test Overdub, I prepared another 12-paragraph audio clip using Descript and asked relatives and friends to identify my own voice and Overdub-generated paragraphs. This wasn’t a scientific trial, but Play.ht’s cloned voice appeared more believable than Descript’s Overdub technique. Overdub’s output vs. my voice:
Because the two products have slightly different use cases, this may not matter much. Play.ht excels at creating extended audio files, such as audio books. Overdub adds brief phrases to an audio recording. I think Overdub’s voices are realistic for this application because synthetic voices are tougher to discern in short audio samples.
Descript’s A.I. other audio enhancement techniques. Studio Sound employs A.I. to improve low-quality microphone audio in a loud setting. to sound studio-recorded. It removes background noise and gently modifies the speaker’s voice to sound like a better microphone.
Descript may also add slight background noise to a new audio clip to match the surrounding sounds.
Independent creative workers benefit from tools like these because they reduce the post-production effort needed to deliver high-quality audio. They may benefit criminals and other troublemakers.
Voice Cloning Dangers
The Washington Post revealed last month that voice cloning fraudsters tricked a Canadian grandma. A man who sounded like her grandson Brandon contacted to ask for money from jail.
The woman and her husband “dashed to their bank in Regina, Saskatchewan, and withdrew 3,000 Canadian dollars ($2,207 in U.S. currency), the daily maximum,” the Post said. They raced to a second branch for money.”
Luckily, the second branch manager cautioned them that the call was likely a fraud. Brandon was alright without the money. But schemes like these will certainly increase in the coming years.
Fake recordings of celebrities like Joe Biden and Taylor Swift saying humorous and occasionally unpleasant things has proliferated in recent months. The executive director of SAG-AFTRA, which represents actors, singers, and broadcast journalists, Duncan Crabtree-Ireland, warns about the tendency. He worries about people utilizing voice cloning to produce phony celebrity endorsements, defrauding clients and denying his members cash.
Fake audio might injure more. Voice cloning might embarrass celebrities or non-celebrities using phony, sexually inappropriate voice samples. In the last days of an election, political operators might deceive voters using phony sounds. Imagine someone exposing embarrassing political candidate recordings or broadcasts on social media.
Play.ht and Descript executives understand these risks. Play.ht CEO Hammad Syed informed me that the business manually reviews training audio and automatically detects racist or sexually explicit audio efforts.
Descript checks for unauthorized voice cloning. When creating a new Overdub voice, the program asks the voice owner to read a short message into the microphone agreeing to have their voice cloned. Descript verifies that the microphone voice matches the training audio file. This should prevent impersonation schemes and celebrity voice cloning using Overdub.
After creating a voice, Descript doesn’t limit Overdub content like Play.ht.
ElevenLabs software was used in many recent celebrity voice-cloning videos. 4chan users started using ElevenLabs software in January to create false hate speech videos of celebrities. ElevenLabs removed voice-cloning from its free tier and released a tool to recognize false videos.
No one I interviewed for this piece thought this technology should be regulated by the government.
SAG-AFTRA’s Crabtree-Ireland told me, “We’re not looking to ban technology or halt forward progress on technology. We are instead looking to work with companies developing these technologies to make sure it’s respectful.” He said he’s gotten a “surprisingly positive reaction” when he’s approached technology companies about safeguards.
Voice copying software will soon be efficient enough to operate on a PC, making this legislation pointless. Once that happens, governments will struggle to restrict its dissemination and usage.
So publicizing high-quality voice cloning software may be the best way to prevent its abuse. Most voice cloning abuses rely on people mistaking sounds for real. If people know about voice cloning, they may be more skeptical of their own hearing.