Gemini Audio

Advanced real-time audio models, built on Gemini

Build with Gemini Audio

View developer docs

Live voice agents

Build natural-sounding voice agents capable of managing complex workflows, even in noisy real-world conditions.

Build with Gemini Audio

Expressive audio generation

Craft anything from short snippets to long-form narratives, with granular control over style, tone, and performance.

Build with Gemini Audio

Live speech translation

Create direct translations that preserve speech characteristics like the intonation, pacing and pitch of the original speaker with Gemini’s speech-to-speech translation capabilities.

Learn more

Audio understanding

Summarize events, extract specific data, and get an outline of context, directly from your audio files with Gemini’s audio understanding.

Build with Gemini Audio

Explore the latest

A growing family of audio models that’s empowering users, developers, and enterprises.

Capabilities
Safety
Try Gemini Audio

Capabilities

Talk, create, and control audio with Gemini Audio models. Engage in fluid, natural conversation or generate expressive audio using simple, natural language prompts.

Your browser does not support the video tag.

Live voice agents

Build natural-sounding voice agents capable of managing complex workflows, even in noisy real-world conditions.

Your browser does not support the video tag.

Expressive speech generation

Craft anything from short snippets to long-form narratives, with granular control over style, tone, and performance.

Your browser does not support the video tag.

Live speech translation

Create direct translations that preserve speech characteristics like the intonation, pacing and pitch of the original speaker with Gemini’s speech-to-speech translation capabilities.

Your browser does not support the video tag.

Audio understanding

Summarize events, extract specific data, and get an outline of context, directly from your audio files with Gemini’s audio understanding.

Live voice agents

Build natural and reliable next-gen voice agents with 3.1 Flash Live.

Your browser does not support the video tag.

Real-time action

Uses tools and calls other functions from chat. That means agents can use real-time information from sources like Google Search, or even custom developer-built tools, making conversations more practical.

Your browser does not support the video tag.

Conversation context awareness

Knows the difference between direct engagement and background chatter. It understands the rhythm of your speech, knowing precisely when to respond—and when to stay silent.

Your browser does not support the video tag.

Robust steerability

Maintains specific personas and guidelines throughout long interactions. Even in complex, winding conversations, the model can stay on track – and in character.

Alphanumeric accuracy

Handles complex alphanumeric data – like flight codes or pricing – to a high degree of accuracy and reliability.

Gemini 3.1 Flash Live is powering the next generation of live voice agents, with improved tonal understanding, reliability, and smooth conversational capabilities.

A bar chart titled "Audio MultiChallenge" measuring "Audio Output" percentages across seven models. From highest to lowest, the scores are: Gemini 3.1 Flash Live (Thinking High) at 36.1%, GPT-Realtime 1.5 at 34.7%, Gemini 3.1 Flash Live (Thinking Minimal) at 26.8%, Qwen3 Omni 30B A3B Instruct at 24.3%, GPT-4o Audio preview 2025-06-03 at 23.2%, Gemini 2.5 Flash Native Audio preview 12-2025 (Thinking High) at 21.5%, and GPT-Realtime at 20.4%. Footnotes indicate the maximum possible score is 100% with the axis scaled to 40%, and all results are sourced from Scale for audio output models.

A bar chart titled "Big Bench Audio" measuring "Speech Reasoning" accuracy percentages across eight models. From highest to lowest, the scores are: Step-Audio R1.1 (Realtime) at 97.0%, Gemini 3.1 Flash Live (Thinking High) at 95.9%, Grok Voice Agent at 92.9%, Gemini 2.5 Flash Native Audio 12-2025 (Thinking High) at 90.7%, Nova 2.0 Sonic at 86.6%, GPT Realtime at 83.3%, GPT-Realtime 1.5 at 81.1%, and Gemini 3.1 Flash Live (Thinking Minimal) at 70.5%. A footnote indicates all results are sourced from Artificial Analysis.

A bar chart titled "ComplexFuncBench audio" displaying function calling accuracy percentages. Gemini 3.1 Flash Live (Thinking High) scores highest at 90.8%, followed by Gemini 2.5 Flash Native Audio 12-2025 (Thinking High) at 71.5%, and Gemini 2.5 Flash Native Audio 09-2025 (Thinking High) at 66.0%. A footnote indicates original text prompts were converted to audio for evaluation according to original benchmark methodology.

Showcase

“YouTube is transforming its contact center operations by deploying Google’s Gemini CX Agent Studio. Leveraging the Gemini 3.1 Flash Live native audio model, the platform now delivers ultra-low latency and natural voice interactions, delivering positive customer engagement and significantly increasing support capacity. This AI-driven evolution enhances the entire lifecycle—from acquisition to retention—by ensuring rapid, high-quality resolutions. Notably, during high-demand events like NFL Sunday Ticket, viewers now receive near-instantaneous support through kickoff, proving that YouTube can deliver reliable, top-tier service at any scale.”

“Google’s Audio-to-Audio capability moved the sound of our Virtual Agents to more natural sounding agents. It enabled us to have more control of its pitch and intonation. And just moving to Audio-to-Audio removes the huge latency we experience when the Virtual Agents have plenty to convey back to the customers. Planning ahead, we’ve seen the preview of the Gemini 3.1 voice model, and this would just further improve the experience, towards a more natural conversation that Verizon is striving for.”

“Using Gemini 3.1 Flash Live in our Miri AI makeup and outfit assistant has noticeably raised the bar for real-time conversational interactions — instruction adherence in long 20+ turn sessions improved by ~25% over 2.5 Flash, and tool calls for memory recall and structured entry executed reliably in our tests. Voice responses feel significantly more natural, with clearer sentence-level intonation and stable persona performance, making a real difference in how users engage with the app.”

“The biggest win with the new Gemini 3.1 audio model is consistency over time. Longer sessions stay on-persona with far less speaker drift, and the experience feels both more natural and more dependable for production-style voice agents.”

“Integrating Google Gemini’s native audio models has improved The Home Depot’s contact center experience. This conversational AI enables natural, intuitive dialogue, moving away from rigid scripts to foster deeper engagement. The platform accurately captures complex details—like alphanumeric product codes—even in noisy environments. And real-time translations allow our customers to switch languages seamlessly, making our “orange-apron“ expertise more accessible and scalable than ever.”

“The new Gemini 3.1 Flash Live model has been incredible to work with. What impressed me most is the remarkably low latency, which makes interactions feel intuitive and seamless. It also shows significant improvements in both function calling and instruction following. These advances unlock much more natural workflows and make it possible to build agentic applications that simply weren't feasible before.”

“Gemini 3.1 Flash Live is a state-of-the-art tool for character-driven, real-time experiences. We were impressed by the strong characterization and human-like delivery, which added a unique theatrical flair to our Game Master. On the technical side, function calling performed exceptionally well on our character-creation benchmarks. It’s a powerful combination of personality and performance.”

Expressive speech generation

Craft expressive narratives with granular control over style, tone, and performance using Gemini 2.5 Flash and Pro Text-to-Speech.

Dynamic performance

Bring text to life with expressive readings. Request specific emotions, accents, or styles to match your creative vision.

Multi-speaker generation

Generate engaging two-person conversations from a single text input. Create podcasts, interviews, or interactive scenarios with distinct character voices.

Live speech translation

Break down language barriers using Gemini’s speech-to-speech translation capabilities.

Language coverage

Gemini’s world knowledge, multilingual capabilities combined with its native audio capabilities allow it to translate speech in over 70 languages and 2000 language pairs.

Style transfer

Preserves the original speakers' intonation, pacing and pitch, adding depth that conveys not just what is said, but how something is spoken.

Multilingual input

Understands multiple languages simultaneously in a single session, to help you follow multilingual conversations without changing any settings.

Automatic language detection

Identifies the languages being spoken and starts translating – so you don’t need to figure it out yourself.

Noise robustness

Filters out ambient noise so you can hold a conversation comfortably – even in loud, outdoor environments.

Audio understanding

Unlock insights directly from audio files with Gemini’s audio capabilities.

A diagram illustrating the process of converting audio into structured data. On the left, a blue sound wave icon represents the audio input. A horizontal arrow points to the right, leading to a blue grid icon composed of four squares, representing the data output.

Turn audio into data

Transform unstructured audio – like voice notes, support calls, or lectures – into clean, actionable formatted text like JSON, summaries, or lists of actions.

A diagram illustrating speaker separation. At the top, a blue audio waveform represents the input source. A branching line splits downwards from the waveform to two distinct user icons. This visualizes the process of identifying and separating unique voices from a single audio stream.

Precise speaker separation

Accurately distinguish and label multiple speakers within a single transcript. Ensuring clarity and correct attribution in interviews, panels, or meetings.

A diagram illustrating the model's ability to detect non-verbal cues and speech styles. On the left, a blue audio waveform represents the input. Arrows branch out from this waveform to the right, pointing to three specific examples of detected nuances: Top: A smiling face icon labeled "Laughter". Middle: A weary face icon labeled "Sighs". Bottom: An ear icon listening to sound waves, labeled "Whisper". This visualizes how the model captures emotional context beyond just the spoken text.

Understand the moment

Capture more than simple words. Gather the sentiment, style of speaking, and all the bits that make speaking human – like laughter.

Safety

We’ve proactively assessed potential risks during every stage of the development process for these native audio features, using what we’ve learned to inform our mitigation strategies. We validate these measures through rigorous internal and external safety evaluations, including comprehensive red teaming for responsible deployment.

All audio outputs from our models are marked with SynthID, our advanced watermarking technology, allowing you to detect whether an audio track has been created or edited using Google AI.

Learn more

Try Gemini Audio

Google AI Studio

The fastest path from prompt to production 

Try in Google AI Studio

Gemini API

Get started with cutting-edge AI models 

Learn more

Gemini Live API

Low-latency, real-time voice and video interactions with Gemini

Learn more