Branch Module · Deep Dive

Real-Time Voice Agents

Build AI that listens, thinks, and speaks — in real time. The Gemini Multimodal Live API makes voice-first experiences accessible to everyone.

60 min Free Gemini Live API

Why voice is different

You've already used AI in a text window — type a question, wait, read the answer. Voice agents are a completely different experience: you speak, the AI responds in under a second, and a natural conversation unfolds. No typing, no waiting for paragraphs to load.

This isn't like Siri or Alexa, which route through rigid command parsing. Modern voice agents use a large language model directly — they can understand nuance, handle unexpected questions, and respond in context. A customer asks your café's voice bot "do you have anything vegan that isn't salad?" and it actually answers.

The key ingredient is streaming. Traditional AI waits until it has a complete response, then sends it all at once. Real-time voice agents stream audio continuously — they start speaking before they've finished "thinking," just like a person would.

🎤You speakmicrophone input
〰️Audio streamWebSocket / PCM16
🧠Gemini Livereal-time LLM
🔊AI speaksstreaming audio out
Key insight

The whole pipeline — from your voice to the AI's voice — can complete in under 600ms on a good connection. That's fast enough to feel like a real conversation, not a query-and-response system.

Gemini Multimodal Live API

Google's Gemini Multimodal Live API (introduced with Gemini 2.0) is purpose-built for real-time, low-latency interaction. Unlike the standard Gemini API where you send a request and wait for a complete response, the Live API opens a persistent WebSocket connection that streams data both ways simultaneously.

What makes it "multimodal" is that you can send not just audio, but video frames too — meaning a voice agent could also see what you're pointing a camera at and respond to that context. A customer shows their phone to a camera and says "does this go with your menu?" — the agent can see the image and answer.

Sub-second latency
Streaming architecture means the AI starts responding before it's finished processing — no awkward waits.
🎙️
Native audio
Sends and receives raw PCM16 audio — no speech-to-text or text-to-speech middleware needed.
📷
Vision + voice
Stream video frames alongside audio. The model sees and hears at the same time, responding to both.
🔁
Interruption-aware
The agent can detect when you start talking mid-response and naturally stop, just like a human would.
📋
System instructions
Set a custom persona and knowledge base — your café bot knows your full menu, your gallery agent knows your artists.
🌐
Free to start
Google AI Studio free tier includes Live API access. No credit card required to prototype.

How to build one

You don't need to be a programmer to understand the shape of a voice agent. Here's the four-part recipe:

// Minimal JavaScript voice agent skeleton // Full example: https://github.com/google-gemini/multimodal-live-api-web-console const ws = new WebSocket( `wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent?key=${API_KEY}` ); // Step 2 — send setup ws.addEventListener('open', () => { ws.send(JSON.stringify({ setup: { model: "models/gemini-2.0-flash-live-001", generation_config: { response_modalities: ["AUDIO"] }, system_instruction: { parts: [{ text: "You are Kia, the friendly voice assistant for Raglan Roast Café. You know our full menu and opening hours." }] } } })); }); // Step 3 — stream mic audio function sendAudioChunk(base64Chunk) { ws.send(JSON.stringify({ realtimeInput: { mediaChunks: [{ mimeType: "audio/pcm;rate=16000", data: base64Chunk }] } })); } // Step 4 — receive and play ws.addEventListener('message', (event) => { const msg = JSON.parse(event.data); const audioPart = msg?.serverContent?.modelTurn?.parts ?.find(p => p.inlineData?.mimeType?.includes('audio')); if (audioPart) playAudioChunk(audioPart.inlineData.data); });
Try it without writing code

Google AI Studio has a built-in "Live" mode — click the microphone icon in any conversation. This is the same Gemini Live API running in the browser. You can test your system prompt idea there before writing a single line of code.

Real use cases for Raglan businesses

Voice agents aren't just for big tech companies. Here's what they could look like for businesses in Raglan right now — each one buildable in an afternoon with AI Studio's free tier.

☕ Café / restaurant
Phone order assistant
A voice agent that answers your business phone, takes orders, answers menu questions, and tells callers your hours — all without interrupting your staff. Forwards complicated calls to you.
🏄 Surf school
Booking & conditions bot
Callers ask about lesson availability, beginner requirements, and pricing. The agent answers in real time and takes their name and contact. Bookings land in a simple spreadsheet.
🎨 Gallery / studio
Audio guide
Visitors scan a QR code to launch a voice guide that knows every piece in the exhibition. They can ask "who made this?" or "what's the story behind this piece?" — and the guide answers naturally.
🛖 Accommodation
Late-night concierge
Guests arriving after hours can ask the voice agent about check-in, WiFi, parking, and local recommendations — without waking anyone up.
📚 Education
Tutoring companion
Students speak their question out loud and get a voice explanation back. More accessible than text for students who struggle with reading, and more natural for spoken-language subjects like te reo Māori.
♿ Accessibility
Eyes-free assistant
Users who can't easily type — or prefer not to — get full AI capabilities through voice alone. The multimodal version can also describe what the phone camera sees, helping low-vision users navigate the world.

When to use voice — and when not to

Voice agents are powerful but not always the right choice. Here's how they compare to the text-based AI interfaces you've already learned:

Scenario Text chat Voice agent
Hands-free situations (driving, cooking) ✗ Needs screen ✓ Natural fit
Customer-facing phone line ✗ Wrong medium ✓ Perfect
Writing a long document ✓ Better output ✗ Awkward
Quick question while doing something ~ OK if near keyboard ✓ Much faster
Code generation ✓ Much better ✗ Hard to follow verbally
Accessibility (vision/motor) ~ Screen reader needed ✓ Designed for it
Noisy environment (loud café) ✓ Works fine ✗ Transcription errors
Complex multi-step tasks ✓ Easier to review ~ Possible but harder
Design principle

The best voice agents are designed for voice from the beginning — short, clear responses, natural pauses, no reading lists of bullet points. If your agent sounds like it's reading a webpage aloud, rebuild it for conversation.

Design your voice agent

Before writing any code, the most important step is designing the experience. What would the conversation feel like? What does your agent know? What's its personality?

Answer the questions below to sketch your voice agent concept. You can paste these answers into Google AI Studio's system prompt and test it immediately — no code required.

Voice Agent Design Canvas
Think of a real use case — your business, your school, your community. Fill this out as a system prompt you'd give to Gemini.

Saved — paste this into Google AI Studio's system prompt and hit the microphone icon to test it live.

🎙️

You've covered real-time voice agents
From WebSocket streaming to Raglan use cases — voice is one of the most human-feeling interfaces AI can take.