My mom couldn't tell the difference after I cloned my voice with AI 2023 (2024) ApnSettings

I utilized A.I. voice-cloning software. I liked the sounds, but I needed feedback.

I recorded the first 12 paragraphs of this essay as a test. Seven randomly selected paragraphs were my voice, while five were created by A.I. I requested my relatives to distinguish.

Mother was baffled. “All the paragraphs sounded like you,” she informed me afterward. She suspected computer-generated sounds. She only identified five paragraphs properly.

My relatives did better. My wife, sister, brother, and mother-in-law nailed all 12 paragraphs. Dad scored 10/12.

My ego suffered when I opened the experiment to the internet (try your luck here).

“The real voices had much more richness and emotional flavor,” one anonymous participant remarked. “The A.I. voices sounded like a grumpy cold sufferer. I hope that’s true! I’ve never seen you.”

That “mopey person with a cold” was me. The A.I. My voice lacks tone and cadence.

My grad school friend predicted incorrectly 11 times. 10 of 12 former employees were wrong.

Only 54% of non-friends guessed right. Listen to the findings, with speakers identified:

Despite its flaws, my replicated voice was impressive. Making it was inexpensive and simple.

Table of Contents

Three Years Have Improved Voice Cloning

In 2020, MIT researchers and Respeecher created a phony video of Richard Nixon revealing the Apollo 11 Moon landing disaster. A behind-the-scenes film displays Nixon’s voice cloning. MIT researchers collected hundreds of brief Nixon audio snippets and had a voice actor record himself saying the same words. The actor recited Nixon’s alternate moon landing address and the program made him sound like Nixon.

Last year, Respeecher earned a contract to clone James Earl Jones as Darth Vader for future Star Wars movies. It’s expensive. “A project usually takes several weeks with fees from 4-digit to 6-digit in $USD,” Respeecher told me when I tried their service.

Instead of spending thousands, I chose Play.ht, a little business. I uploaded a 30-minute video of me reading my choice of text and waited a few hours.

I didn’t require a voice actor because Play.ht uses text-to-speech. My voice trained the machine to create genuine human speech from printed text in minutes. It was free. Play.ht’s free plan cloned my voice. Commercial plans start at $39/month.

Realistic text-to-speech systems like Play.ht are difficult to design since people pronounce words differently based on context. Depending on what comes before or after a word in a phrase, we follow intricate, mostly subconscious rules regarding which words to emphasis.

Humans speak words randomly too. Sometimes we halt, contemplate, or get sidetracked. Any system that pronounces words and sentences the same way will sound robotic.

Because Respeecher follows the voice actor’s lead, it doesn’t have to worry about these concerns. The A.I. system must comprehend human speech to know how long to pause, which words to accentuate, etc.

Play.ht employs a transformer, a neural network built by Google in 2017 that underpins several generative A.I. systems. systems since. GPT, OpenAI’s huge language model family, stands for transformer.

Transformer models are strong because they can “pay attention” to several inputs at once. Play.ht’s model considers the entire phrase while generating audio for a new word. It may adjust speech speed, intensity, and other features to match the cloned voice.

Text-to-Speech Voice Cloning Challenge

Play.ht is for creative people who make podcasts, audiobooks, instructional videos, TV advertising, etc. Compared to Descript, a powerful audio editing application, the startup is an underdog.

Descript’s 2017 version automatically transcribed audio. Descript would remove words from the transcript and the audio file.

Descript incorporated Lyrebird, a voice-cloning startup, in 2019. Since 2020, Descript’s Overdub function lets you add words to a transcript and produce realistic audio of your voice reciting them. Like Play.ht, Overdub needs a long voice sample to train.

To test Overdub, I prepared another 12-paragraph audio clip using Descript and asked relatives and friends to identify my own voice and Overdub-generated paragraphs. This wasn’t a scientific trial, but Play.ht’s cloned voice appeared more believable than Descript’s Overdub technique. Overdub’s output vs. my voice:

Understanding AI, a newsletter on A.I. how it affects our world.