I Ran OmniVoice Again, Timed the Dub, and Hit Some Bizarre Errors
This time I dubbed an English video into Korean. I timed each step of a single dub, and even ran into robotic noise where a voice should have been.
Following up on the last post, I spent a bit more time with OmniVoice. This time I was curious about two things. One was how long it actually takes to dub a single video. The other was the opposite direction from last time: what happens if I turn an English video into Korean.
So I grabbed a short clip of a Trump speech (in English) and dubbed it into Korean. Here’s the original first:
Original: the English clip I wanted to dub into Korean
And here’s the result after OmniVoice dubbed it into Korean. It cloned the original voice and had it speak Korean:
OmniVoice: dubbed English → Korean, with the original voice cloned
How long does a single dub take
To take one 22-second clip all the way from transcription to translation to voice synthesis to export took about 3 minutes total. All of it ran on my MacBook, with no internet. Broken down by step, it goes like this:
- Prep (pulling the audio out of the video and splitting voice from background): about 7 seconds
- Transcription (turning the speech into text): about 29 seconds
- Translation (English to Korean): about 90 seconds
- Building the voice profile (registering the original voice): about 5 seconds
- Voice synthesis + cloning: about 49 seconds
- Export (merging it back into the video): about 2 seconds
One fun detail: the first synthesis run takes longer, but running it again cuts the time roughly in half. That’s because the time to load the AI model into memory for the first time is only counted on that first run.
Which model ran at each step
The dub is split into steps, and a different model handles each one:
- Splitting voice and background: Demucs
- Transcription: WhisperX
- Word timing: wav2vec2
- Speaker separation (telling apart who’s talking): WavLM
- Translation: gemma2:27b (better quality than the built-in translator)
- Voice synthesis + cloning: OmniVoice
It wasn’t all smooth
Two things tripped me up along the way.
One, sometimes where a voice should have been, I got a crushed, staticky noise instead of a human voice. This time I had it build Korean from an English voice sample, and that voice trying to imitate Korean, a language it had never spoken, sometimes came out broken. So I switched to a setting that refines the synthesis over more passes, re-ran it, and the Trump video came out fine.
Two, when I ran the translation, one sentence came out completely different from the original, so I had to go in and fix it by hand.
So, the takeaway
Of all the open-source dubbing tools I’ve used, this one was about as easy to install as a single click. It also ran more smoothly than any of the others I’ve tried, which I appreciated. That said, the output quality isn’t at a level I’m happy with yet.
Has anyone else here used OmniVoice? I’d love to hear what kind of videos you tried and how the quality turned out. I ran it on a Mac, so I’m also curious to hear from people who’ve used it on other setups.
What I liked
- About as easy to install as one click (the easiest of any open-source tool I've tried)
- Ran more smoothly than any open-source tool I've tried so far
- Fast processing (about 3 minutes for a 22-second clip)
What I didn't
- The output quality isn't satisfying yet
- Cross-language cloning (one language's voice speaking another) sometimes breaks into robotic noise
Rating
Get the weekly AI dubbing digest
A weekly roundup of AI dubbing & news. No spam, unsubscribe anytime.
Comments (0)
No comments yet — be the first.