From Robot Voices to Soulful Storytellers

I’ve always believed that a story isn’t truly finished until it’s heard. There’s a reason we read aloud to children before bed, or why we huddle around campfires to recount old legends. The voice adds a layer of magic that ink on paper—or pixels on a screen—sometimes can’t quite capture.
But let’s be honest: until recently, turning a book into an audiobook was a luxury reserved for the bestsellers. You needed a studio, a sound engineer, and a voice actor with the patience of a saint. For a startup like Mythoria, where we want every child (and adult!) to be the hero of their own personalized book, that traditional path was a closed door.
Then, AI kicked the door down. 🚪💥
This weekend, I went down the rabbit hole. I spent 48 hours testing the absolute cutting edge of AI text-to-speech (TTS) engines to find the perfect voice for Mythoria. And guess what? We have some massive news to share along the way.
🎧 The Golden Ticket: The ElevenLabs Grant
First, the big news. Mythoria has been selected for an ElevenLabs Grant.
For those who don't know, ElevenLabs is basically the "Pixar" of AI voice generation right now. Receiving this grant is a huge validation for us. It means a few incredible things:
- Access to the Future: We get early access to their most advanced models (hello, V3!) before they are widely available.
- Sustainability: It significantly subsidizes the cost of generating high-quality audio, meaning we can offer premium narration to you without breaking the bank.
- Creative Freedom: We can experiment with sound effects and character acting that was previously impossible.
It’s not just tech support; it’s a partnership that lets us put a professional narrator in your pocket.
🇵🇹 The "Accent" Elephant in the Room
Before I break down the tech, I need to talk about something that hits close to home for us in Portugal: Accents.
If you are Portuguese, you know the pain. You open a "Portuguese" option in an app, and 9 times out of 10, it’s Brazilian Portuguese. Now, I do like the Brazilian accent - is musical and sweet—but if you live in Porto or Lisbon reading a story about the Douro River, hearing a Rio accent breaks the immersion instantly.
Finding an AI that nails European Portuguese (pt-PT) without sounding robotic or accidentally switching to Brazilian vowels has been my "white whale." The same goes for European Spanish (es-ES) vs. Latin American.
Consistency is key. A narrator can't start sounding like he's from Coimbra and finish the sentence sounding like he's from São Paulo. This weekend’s tests were brutal on this front, but we found some winners.
🤖 The Battle of the Voices: The Weekend Experiment
I tested everything. From the giants at Google and OpenAI to the specialists at ElevenLabs. Here is the breakdown of the engines we are integrating into Mythoria.
1. OpenAI: The Conversationalist
OpenAI offers two main flavors here: TTS-1-hd and the newer gpt-4o-mini-tts.
- TTS-1-hd: This is the high-definition standard. It’s smooth, very clean, and sounds like a professional broadcaster. It’s great for non-fiction or calm narration.
- gpt-4o-mini-tts: This is the game-changer for dialogue. Because it’s built on a newer, smarter model, it understands context better. If a character asks a question, it sounds inquisitive. If there is a joke, it delivers it with a lighter tone. It’s less "reading text" and more "talking to you."
Verdict: Incredible for conversational flow, but sometimes struggles to maintain a strict specific accent if you push it too hard with local slang. Sometimes it sounds a little bit “metallic” on the background.
2. Google: The Specialist
Google is treating TTS with two very different philosophies: Chirp vs. Gemini.
- Google Chirp (v3): Think of this as the "Studio Voice." It’s incredibly polished. The pt-PT voices here are solid—stable, clear, and very European. It doesn't hallucinate; it reads exactly what is there with high fidelity.
- Google Gemini (2.5): This is the wild card. It’s a multi-modal model. You can prompt it like a director: "Read this like an old wizard who is slightly out of breath." It attempts to act. It’s riskier because it can sometimes be unpredictable, but when it hits, it’s magic.
Verdict: Chirp is our rock for stability; Gemini is our laboratory for experimental character voices.
3. ElevenLabs: The Performer (V2 vs. V3)
This is where the magic happens.
- ElevenLabs V2: The reliable workhorse. It clones voices perfectly and handles emotion well. It’s what most people think of when they hear "good AI voice."
- ElevenLabs V3: This is what the grant unlocked for us. V3 isn't just reading; it's performing. It understands dramatic pacing. You can tag parts of the text to change emotion mid-sentence. You can have a character whisper and then scream without splitting the audio files. It’s the closest thing to having a human actor in the booth.
🦁 The "Roarrr" Test: Listen for Yourself
To test these, I wrote a small scene inspired by my brother João and me back in the day (we were… energetic kids). I wanted to see how the models handled dialogue, narration, and sound effects written as text. And also, to check if they can properly spell our family name 😉
The Excerpt:
Eu e o meu irmão João Jácome estávamos colados às grades, de olhos presos no leão adormecido. O sol batia-lhe na juba dourada, que parecia um fogo calmo a ondular. De repente, o leão abriu um olho, esticou as patas e soltou um “Rooaarr!” tão profundo que o chão tremeu debaixo dos nossos pés. Eu dei um passo atrás, meio assustado, meio a rir, enquanto o meu coração disparava como um tambor.
Atrás de nós, uma gata do jardim do zoo aproximou-se, curiosa, esfregando-se nas nossas pernas e soltando um tímido “miau”. A diferença entre o “Rooaarr!” gigante e o “miau” pequenino fez-nos rebentar a rir. O João tentou imitar os dois sons ao mesmo tempo, falhou redondamente, e acabou de braços no ar, a fazer caretas, enquanto eu pensava: “Um dia vou escrever esta cena num livro… e num audiobook.”
Here is how the different engines tackle this scene.
🧪 OpenAI (tts-1-hd)
Clean, but without emotion. Doesn't have the European Portuguese accent.
OpenAI TTS Sample
🧪 OpenAI (gpt-4o-mini-tts)
Clean, conversational, but plays it safe with the sound effects.
OpenAI GPT4o Sample
🧪 Google Chirp (HD)
Super crisp audio, perfect pronunciation, but the "Roar" feels a bit like a word rather than a sound.
Google Chirp 3
🧪 Google Gemini 2.5(HD)
The best all-round TTS engine. Good consistent and good (although not perfect) human emotion.
Google Gemini 2.5 Flash
Google Gemini 2.5 Flash
Google Gemini 2.5 Flash A little bit more expensive, but it is worth it.
Google Gemini 2.5 Pro
Google Gemini 3 is not yet available as a Text-to-Speech engine 😞
🧪 ElevenLabs V2 Multilanguage (The Grant Winner)
Good consistency but lacks emotion and understanding of the sounds and the message being spoken.
ElevenLabs V2 Multilanguage
🧪 ElevenLabs V3 Alpha (The Grant Winner)
Notice the pacing. The whisper is a whisper. The "ROAR" has intensity. The "Miau" sounds playful.
Google Gemini 2.5 Pro
📊 The Showdown: Model Comparison
Here is the cheat sheet for the tech-savvy among you.
| Feature | OpenAI (gpt-4o) | Google Chirp | Google Gemini | ElevenLabs V2 | ElevenLabs V3 |
|---|---|---|---|---|---|
| Best Use Case | Chatty characters, fluid dialogue | Professional, neutral narration | Experimental character acting | Reliable emotive storytelling | High-drama performance |
| Emotional Range | High (Context aware) | Medium (Stable) | Very High (Promptable) | High | Extreme (Directable) |
| Accent Control | Good, but American bias creeps in | Excellent (Region specific) | Good (Promptable) | Good (Clone dependent) | Excellent (Tag dependent) |
| Latency | Fast | Medium | Slower | Fast | Real-time capable |
| Cost (Est.) | Low | Medium | High | Medium | Premium (Grant helps!) |
| "Roarr" Factor | Reads the word enthusiastically | Reads the word clearly | Acts the word | Acts the word | Becomes the lion |
🎶 Soundscapes: Music & Effects
With the new ElevenLabs capabilities, we aren't just generating voice. We are generating atmosphere.
Music as Stage Lighting
Good background music is like good lighting: you barely notice it, but it makes everything feel real. We follow the golden rule of pro audio: The narration is the star. The music stays low, clean, and instrumental.
The "Mood" Engine
Instead of picking random songs, we mapped out 10 custom audio moods. Mythoria analyzes your story's Audience and Style to pick the perfect fit automatically.
- Smart Matching: A "Horror" story for a 7-year-old gets a safe adventure track, not a nightmare soundtrack.
- Baby Proof: Stories for toddlers (0–2) always get the
Soft Bedtimelullaby, no matter the genre. - Genre Lock: Sci-fi gets space ambience; Romance gets warm acoustics.
How We Direct the AI
We use models like ElevenLabs Music to generate these royalty-free soundtracks. But we don't just say "make music." We feed it strictly engineered prompts to ensure it doesn't distract from the story:
"Instrumental only. Slow 60 BPM. Warm strings. No percussion. Loop quietly under the voice."
The Perfect Mix
Finally, we use "ducking." When the narrator speaks or whispers, the music volume dips automatically. It’s a cinematic score that knows its place—quietly, intentionally serving your story.
You can now try this! We have added the option to include a background music when narrating a story. This improves the overall atmosphere of the audiobook.
The Future is Loud (and Personal)
This grant changes the game for Mythoria. It means your stories won't just look beautiful—they will sound alive. We are rolling out these features in beta soon, starting with the European Portuguese and English voices.
Keep writing. We’ll handle the speaking. 🎙️✨