From Robot Voices to Soulful Storytellers

But let’s be honest: until recently, turning a book into an audiobook was a luxury reserved for the bestsellers. You needed a studio, a sound engineer, and a voice actor with the patience of a saint. For a startup like Mythoria, where we want every child (and adult!) to be the hero of their own personalized book, that traditional path was a closed door.

Then, AI kicked the door down. 🚪💥

This weekend, I went down the rabbit hole. I spent 48 hours testing the absolute cutting edge of AI text-to-speech (TTS) engines to find the perfect voice for Mythoria. And guess what? We have some massive news to share along the way.

🎧 The Golden Ticket: The ElevenLabs Grant

First, the big news. Mythoria has been selected for an ElevenLabs Grant.

For those who don't know, ElevenLabs is basically the "Pixar" of AI voice generation right now. Receiving this grant is a huge validation for us. It means a few incredible things:

Access to the Future: We get early access to their most advanced models (hello, V3!) before they are widely available.
Sustainability: It significantly subsidizes the cost of generating high-quality audio, meaning we can offer premium narration to you without breaking the bank.
Creative Freedom: We can experiment with sound effects and character acting that was previously impossible.

It’s not just tech support; it’s a partnership that lets us put a professional narrator in your pocket.

🇵🇹 The "Accent" Elephant in the Room

Before I break down the tech, I need to talk about something that hits close to home for us in Portugal: Accents.

If you are Portuguese, you know the pain. You open a "Portuguese" option in an app, and 9 times out of 10, it’s Brazilian Portuguese. Now, I do like the Brazilian accent - is musical and sweet—but if you live in Porto or Lisbon reading a story about the Douro River, hearing a Rio accent breaks the immersion instantly.

Finding an AI that nails European Portuguese (pt-PT) without sounding robotic or accidentally switching to Brazilian vowels has been my "white whale." The same goes for European Spanish (es-ES) vs. Latin American.

Consistency is key. A narrator can't start sounding like he's from Coimbra and finish the sentence sounding like he's from São Paulo. This weekend’s tests were brutal on this front, but we found some winners.

🤖 The Battle of the Voices: The Weekend Experiment

I tested everything. From the giants at Google and OpenAI to the specialists at ElevenLabs. Here is the breakdown of the engines we are integrating into Mythoria.

1. OpenAI: The Conversationalist

OpenAI offers two main flavors here: TTS-1-hd and the newer gpt-4o-mini-tts.

TTS-1-hd: This is the high-definition standard. It’s smooth, very clean, and sounds like a professional broadcaster. It’s great for non-fiction or calm narration.
gpt-4o-mini-tts: This is the game-changer for dialogue. Because it’s built on a newer, smarter model, it understands context better. If a character asks a question, it sounds inquisitive. If there is a joke, it delivers it with a lighter tone. It’s less "reading text" and more "talking to you."

Verdict: Incredible for conversational flow, but sometimes struggles to maintain a strict specific accent if you push it too hard with local slang. Sometimes it sounds a little bit “metallic” on the background.

2. Google: The Specialist

Google is treating TTS with two very different philosophies: Chirp vs. Gemini.

Google Chirp (v3): Think of this as the "Studio Voice." It’s incredibly polished. The pt-PT voices here are solid—stable, clear, and very European. It doesn't hallucinate; it reads exactly what is there with high fidelity.
Google Gemini (2.5): This is the wild card. It’s a multi-modal model. You can prompt it like a director: "Read this like an old wizard who is slightly out of breath." It attempts to act. It’s riskier because it can sometimes be unpredictable, but when it hits, it’s magic.

Verdict: Chirp is our rock for stability; Gemini is our laboratory for experimental character voices.

3. ElevenLabs: The Performer (V2 vs. V3)

This is where the magic happens.

ElevenLabs V2: The reliable workhorse. It clones voices perfectly and handles emotion well. It’s what most people think of when they hear "good AI voice."
ElevenLabs V3: This is what the grant unlocked for us. V3 isn't just reading; it's performing. It understands dramatic pacing. You can tag parts of the text to change emotion mid-sentence. You can have a character whisper and then scream without splitting the audio files. It’s the closest thing to having a human actor in the booth.

🦁 The "Roarrr" Test: Listen for Yourself

To test these, I wrote a small scene inspired by my brother João and me back in the day (we were… energetic kids). I wanted to see how the models handled dialogue, narration, and sound effects written as text. And also, to check if they can properly spell our family name 😉

The Excerpt:

Eu e o meu irmão João Jácome estávamos colados às grades, de olhos presos no leão adormecido. O sol batia-lhe na juba dourada, que parecia um fogo calmo a ondular. De repente, o leão abriu um olho, esticou as patas e soltou um “Rooaarr!” tão profundo que o chão tremeu debaixo dos nossos pés. Eu dei um passo atrás, meio assustado, meio a rir, enquanto o meu coração disparava como um tambor.

Atrás de nós, uma gata do jardim do zoo aproximou-se, curiosa, esfregando-se nas nossas pernas e soltando um tímido “miau”. A diferença entre o “Rooaarr!” gigante e o “miau” pequenino fez-nos rebentar a rir. O João tentou imitar os dois sons ao mesmo tempo, falhou redondamente, e acabou de braços no ar, a fazer caretas, enquanto eu pensava: “Um dia vou escrever esta cena num livro… e num audiobook.”

Here is how the different engines tackle this scene.

🧪 OpenAI (tts-1-hd)

Clean, but without emotion. Doesn't have the European Portuguese accent.

OpenAI TTS Sample

🧪 OpenAI (gpt-4o-mini-tts)

Clean, conversational, but plays it safe with the sound effects.

OpenAI GPT4o Sample

🧪 Google Chirp (HD)

Super crisp audio, perfect pronunciation, but the "Roar" feels a bit like a word rather than a sound.

Google Chirp 3

🧪 Google Gemini 2.5(HD)

The best all-round TTS engine. Good consistent and good (although not perfect) human emotion.

Google Gemini 2.5 Flash

Google Gemini 2.5 Flash

Google Gemini 2.5 Flash A little bit more expensive, but it is worth it.

Google Gemini 2.5 Pro

Google Gemini 3 is not yet available as a Text-to-Speech engine 😞

🧪 ElevenLabs V2 Multilanguage (The Grant Winner)

Good consistency but lacks emotion and understanding of the sounds and the message being spoken.

ElevenLabs V2 Multilanguage

🧪 ElevenLabs V3 Alpha (The Grant Winner)

Notice the pacing. The whisper is a whisper. The "ROAR" has intensity. The "Miau" sounds playful.

Google Gemini 2.5 Pro

📊 The Showdown: Model Comparison

Here is the cheat sheet for the tech-savvy among you.

Feature	OpenAI (gpt-4o)	Google Chirp	Google Gemini	ElevenLabs V2	ElevenLabs V3
Best Use Case	Chatty characters, fluid dialogue	Professional, neutral narration	Experimental character acting	Reliable emotive storytelling	High-drama performance
Emotional Range	High (Context aware)	Medium (Stable)	Very High (Promptable)	High	Extreme (Directable)
Accent Control	Good, but American bias creeps in	Excellent (Region specific)	Good (Promptable)	Good (Clone dependent)	Excellent (Tag dependent)
Latency	Fast	Medium	Slower	Fast	Real-time capable
Cost (Est.)	Low	Medium	High	Medium	Premium (Grant helps!)
"Roarr" Factor	Reads the word enthusiastically	Reads the word clearly	Acts the word	Acts the word	Becomes the lion

🎶 Soundscapes: Music & Effects

With the new ElevenLabs capabilities, we aren't just generating voice. We are generating atmosphere.

Music as Stage Lighting

Good background music is like good lighting: you barely notice it, but it makes everything feel real. We follow the golden rule of pro audio: The narration is the star. The music stays low, clean, and instrumental.

The "Mood" Engine

Instead of picking random songs, we mapped out 10 custom audio moods. Mythoria analyzes your story's Audience and Style to pick the perfect fit automatically.

Smart Matching: A "Horror" story for a 7-year-old gets a safe adventure track, not a nightmare soundtrack.
Baby Proof: Stories for toddlers (0–2) always get the Soft Bedtime lullaby, no matter the genre.
Genre Lock: Sci-fi gets space ambience; Romance gets warm acoustics.

From Robot Voices to Soulful Storytellers

🎧 The Golden Ticket: The ElevenLabs Grant

🇵🇹 The "Accent" Elephant in the Room

🤖 The Battle of the Voices: The Weekend Experiment

1. OpenAI: The Conversationalist

2. Google: The Specialist

3. ElevenLabs: The Performer (V2 vs. V3)

🦁 The "Roarrr" Test: Listen for Yourself

🧪 OpenAI (tts-1-hd)

OpenAI TTS Sample

🧪 OpenAI (gpt-4o-mini-tts)

OpenAI GPT4o Sample

🧪 Google Chirp (HD)

Google Chirp 3

🧪 Google Gemini 2.5(HD)

Google Gemini 2.5 Flash

Google Gemini 2.5 Pro

🧪 ElevenLabs V2 Multilanguage (The Grant Winner)

ElevenLabs V2 Multilanguage

🧪 ElevenLabs V3 Alpha (The Grant Winner)

Google Gemini 2.5 Pro

📊 The Showdown: Model Comparison

🎶 Soundscapes: Music & Effects

Music as Stage Lighting

The "Mood" Engine

How We Direct the AI

The Perfect Mix

The Future is Loud (and Personal)