As part of our marketing efforts and website development for metabox.io, video content plays a big role for our team, from user guides, product demos, and technical tutorials, to help users better understand Meta Box and use it more easily.
- 1. Our First Attempt: Building Our Own Text to Speech Tool
- 2. ElevenLabs: When AI Starts to Sound Human
- 3. How Our Team Uses ElevenLabs in Practice
- 4. Some Tips We Learned from Using ElevenLabs Day to Day
- 5. Switching to Minimax: Is It a Good Alternative?
- 6. Free Version from Minimax – Enough for Basic Needs
- 7. Paid Versions from Minimax – Better for Long Videos & Team Work
- 8. Testing with Our Native Language
- 9. Last Words
Besides the usual challenges of producing technical videos, we ran into another major issue: language. All Meta Box videos are in English, since our users come from all over the world. The thing is, everyone in the Meta Box team is Vietnamese, and English isn’t our native language. Because of that, recording English voice-overs that sound smooth and natural, like native speakers, has always been a real challenge for us.
We also learned that even in your mother tongue, producing a clean, clear, and well-paced voice recording isn’t easy at all. There are so many factors that affect audio quality: the microphone, recording environment, voice tone, echo, breathing, pauses, emotion, and more. Sometimes everything sounds perfectly fine in normal conversation, but once the microphone is on, we end up recording over and over again and still feel unsatisfied with the result.
So, we had to try many different approaches:
- Janessa Tran and Anh Tran recorded the voice-overs ourselves;
- Hiring native English speakers;
- Hiring English-major students with good pronunciation.
All of them involved real people. While the pronunciation was generally good, even with native speakers, there was another issue: they didn’t fully understand the product or the technical context. This often led to misinterpretations. On top of that, the production speed couldn’t keep up with our content needs, and the cost was far from cheap.
That’s when we started looking for a more scalable solution.
Our First Attempt: Building Our Own Text to Speech Tool
To avoid relying on human voice actors, the team tried building a Text to speech (TTS) system using Google’s available tools.

Anh Tran personally built this tool on our internal website so that the entire team could use it.
Although it was based on Google’s TTS, Anh Tran customized all the default settings to match our needs, for example, presetting the voice, pitch, rate, and volume. This way, every time we needed audio, we only had to paste in the text and generate it, which saved a lot of time.
Pros:
- Very fast
- Low cost (almost free, since we didn’t use up Google’s credits)
Cons:
- Lacked emotion and natural rhythm
- Native listeners could feel tired or bored after listening for a while
You can see one of our old videos using TTS voice-over here.
This approach worked well for short technical videos. For longer tutorials or marketing videos, the listening experience wasn’t great, and we even received some complaints from users. Eventually, we had to move on.
ElevenLabs: When AI Starts to Sound Human
I had purchased an ElevenLabs license early on, but at the time, the quality wasn’t convincing. The voice sounded stiff, accents were inconsistent, and the cloned voice didn’t really feel like me (Janessa), it even leaned toward an Indian accent at times. It’s not me, honestly.
About a year later, I gave it another try and was genuinely surprised. Using the same old voice sample, ElevenLabs produced a much better clone:
- More natural pauses;
- Better rhythm and breathing;
- A speaking style that closely matched my own.
In some cases, the AI voice even sounded clearer and easier to listen to than my real voice, simply because the pronunciation was more consistent and polished. From that point on, our team decided to switch entirely to ElevenLabs for English technical videos.
How Our Team Uses ElevenLabs in Practice
When you log into ElevenLabs, you’ll see a dashboard with many sections and features: Text to Speech, Voice Changer, Sound Effects, Image & Video.

At first glance, it might feel overwhelming, but these features aren’t just there for show. Voice Isolator helps remove background noise, Sound Effects let you add audio effects, and Speech to Text allows you to extract transcripts from old videos. If you’re creating more advanced content, each of these features has its own practical use.
But for our use case, adding voice-overs to Meta Box technical videos, the team focuses almost entirely on one section: Voices.
After preparing a script carefully, we go to Voices > My Voices, where we manage all created voices.

Here, we simply click Create or Clone a Voice. A popup appears, allowing you to create a new voice from scratch, do a quick clone, or perform a more advanced clone.

For our workflow, we mainly use Instant Voice Clone, because it’s the easiest option and fits our needs best. With just a short recording, you can already get a voice clone that’s good enough for tutorial videos.
From our experience:
- No need for a professional studio, just a quiet environment;
- Speak naturally, the same way you usually do in videos;
- 1–2 minutes of clean audio is enough.
You can upload an existing audio file or record directly in the tool. After uploading, the system processes the file and creates a new cloned voice in My Voices. You can listen to each voice, compare them, and pick the one you like most. Then, simply paste your script into Text to Speech to generate the audio.

You can also fine-tune parameters like speed, stability, similarity, and other settings to get the result you want.
Some Tips We Learned from Using ElevenLabs Day to Day
One small but important thing we learned from real usage is to keep each generation under 300 characters. When the text is too long, the later part often becomes rushed, lacks emphasis, or doesn’t sound the way you expect.
I also make it a habit to listen carefully to every generated audio file before sending it to the editing stage. ElevenLabs still makes occasional mistakes. For example, “MB” might be read as “MMMMMMB”, or terms like “custom post type” may be broken unnaturally. Sometimes, I have to regenerate the audio a few times to get a good version.
Currently, our team uses the Starter plan. In the past, this plan allowed us to generate enough audio for several videos per month. However, after ElevenLabs reduced the credit limits, the Starter plan now only covers about one or two videos. With our workflow requiring multiple regenerations for editing, this plan no longer meets our needs. Upgrading to higher tiers is possible, but the cost increases quite quickly. That’s why we started looking into alternative tools.
Switching to Minimax: Is It a Good Alternative?
When testing Minimax using the same English script, here’s what I noticed:
The voice quality is quite good. Personally, I find it slightly more natural than ElevenLabs (just slightly).
The pronunciation is accurate and pleasant to listen to.
Processing speed is comparable to ElevenLabs.
The free version is quite generous and usable - enough for about two videos longer than 10 minutes.
The paid version (lowest plan), which I’m currently using, provides enough credits for our team’s needs with more than 5 long videos per month.
Honestly, unless you listen very carefully, it’s quite hard even for me to tell which of our videos use ElevenLabs and which use Minimax. Let’s see if you can guess which tool was used in which video.
Free Version from Minimax – Enough for Basic Needs
Minimax’s dashboard is clean and intuitive. Compared to ElevenLabs, the interface is simpler, with fewer options, but it’s easier to use and less overwhelming, especially for beginners.

The free plan allows up to 10,000 characters and 3 saved voice slots. For my needs, creating just one personal voice clone is already enough.


The workflow is quite similar to ElevenLabs. With the free plan, it’s sufficient for a few short videos, and the quality is good enough.
Paid Versions from Minimax – Better for Long Videos & Team Work
I’m currently using Minimax’s annual Starter plan. Overall, both I and the team are quite satisfied. The voice quality is consistent, and more importantly, the credit limit feels comfortable. And, we haven’t run into shortages during content production.
This plan includes:
- 100,000 credits per month;
- 10 voice slots;
- Unlimited voice modifiers and emotion selection;
- Faster generation speed;
- …
In terms of usage, the paid plan is almost identical to the free version: choose a voice > paste text > adjust speed, pitch, and volume > and then generate audio. There are no extra steps or screens, so if you’re already familiar with the free version, upgrading doesn’t require relearning anything.

What I personally appreciate most is the stability when generating long text. Although ElevenLabs states that it can generate up to 5,000 characters at once, in practice, issues often start appearing after around 1,000 characters: faster reading, uneven rhythm, or a drop in tone at the end. With Minimax, I haven’t encountered these issues, even when generating long segments of 3,000 - 4,000 characters in a single run.
Considering the price and what’s included in the Starter plan, I think this is a reasonable choice for small teams or creators who produce long videos and need stable audio. If you’ve used higher Minimax plans, I’d love to hear about your experience.
Testing with Our Native Language
I also run other social channels for my own business and create many videos about daily life there, so I tested both tools with Vietnamese, my native language. Unfortunately, the results weren’t very good:
Incorrect pronunciation, especially with Vietnamese diacritics;
Unnatural rhythm and intonation;
It sounds obviously like a machine-generated voice.
I’ve tried some other tools that are better localized for Vietnamese, and their output quality is much better. If you’re not confident in your own voice, these tools can be a good alternative.
Since I’m already very comfortable recording Vietnamese myself, it usually takes me only 3–5 minutes to record under 500 words. So I still prefer recording manually to keep the voice and emotions as natural as possible.
Last Words
AI voice-over hasn’t replaced humans in every situation yet. But for technical videos, product demos, and other low-emotion content, they are genuinely useful and worth trying.
At the moment, both ElevenLabs and Minimax work very well with English. For Vietnamese (and possibly other languages), there’s still room for improvement. If you’ve tried AI voice with other languages, how was your experience? Feel free to share it with us!
Meta Box Weekly Updates: v4.15.0 with OpenStreetMap
Meta Box is Now Compatible with Gutenberg