A growing family of audio models that’s empowering users, developers, and enterprises.



Expressive speech generation

Craft expressive narratives with granular control over style, tone, and performance using Gemini 2.5 Flash and Pro Text-to-Speech.

Dynamic performance

Bring text to life with expressive readings. Request specific emotions, accents, or styles to match your creative vision.

Multi-speaker generation

Generate engaging two-person conversations from a single text input. Create podcasts, interviews, or interactive scenarios with distinct character voices.


Live speech translation

Break down language barriers using Gemini’s speech-to-speech translation capabilities.

Language coverage

Gemini’s world knowledge, multilingual capabilities combined with its native audio capabilities allow it to translate speech in over 70 languages and 2000 language pairs.

Style transfer

Preserves the original speakers' intonation, pacing and pitch, adding depth that conveys not just what is said, but how something is spoken.

Multilingual input

Understands multiple languages simultaneously in a single session, to help you follow multilingual conversations without changing any settings.

Automatic language detection

Identifies the languages being spoken and starts translating – so you don’t need to figure it out yourself.

Noise robustness

Filters out ambient noise so you can hold a conversation comfortably – even in loud, outdoor environments.


Audio understanding

Unlock insights directly from audio files with Gemini’s audio capabilities.

A diagram illustrating the process of converting audio into structured data. On the left, a blue sound wave icon represents the audio input. A horizontal arrow points to the right, leading to a blue grid icon composed of four squares, representing the data output.A diagram illustrating the process of converting audio into structured data. On the left, a blue sound wave icon represents the audio input. A horizontal arrow points to the right, leading to a blue grid icon composed of four squares, representing the data output.

Turn audio into data

Transform unstructured audio – like voice notes, support calls, or lectures – into clean, actionable formatted text like JSON, summaries, or lists of actions.

A diagram illustrating speaker separation. At the top, a blue audio waveform represents the input source. A branching line splits downwards from the waveform to two distinct user icons. This visualizes the process of identifying and separating unique voices from a single audio stream.A diagram illustrating speaker separation. At the top, a blue audio waveform represents the input source. A branching line splits downwards from the waveform to two distinct user icons. This visualizes the process of identifying and separating unique voices from a single audio stream.

Precise speaker separation

Accurately distinguish and label multiple speakers within a single transcript. Ensuring clarity and correct attribution in interviews, panels, or meetings.

A diagram illustrating the model's ability to detect non-verbal cues and speech styles. On the left, a blue audio waveform represents the input. Arrows branch out from this waveform to the right, pointing to three specific examples of detected nuances: Top: A smiling face icon labeled "Laughter". Middle: A weary face icon labeled "Sighs". Bottom: An ear icon listening to sound waves, labeled "Whisper". This visualizes how the model captures emotional context beyond just the spoken text.A diagram illustrating the model's ability to detect non-verbal cues and speech styles. On the left, a blue audio waveform represents the input. Arrows branch out from this waveform to the right, pointing to three specific examples of detected nuances: Top: A smiling face icon labeled "Laughter". Middle: A weary face icon labeled "Sighs". Bottom: An ear icon listening to sound waves, labeled "Whisper". This visualizes how the model captures emotional context beyond just the spoken text.

Understand the moment

Capture more than simple words. Gather the sentiment, style of speaking, and all the bits that make speaking human – like laughter.


Safety

We’ve proactively assessed potential risks during every stage of the development process for these native audio features, using what we’ve learned to inform our mitigation strategies. We validate these measures through rigorous internal and external safety evaluations, including comprehensive red teaming for responsible deployment.

All audio outputs from our models are marked with SynthID, our advanced watermarking technology, allowing you to detect whether an audio track has been created or edited using Google AI.


Try Gemini Audio