Branch: Real-Time Voice Agents

01 / The shift

Why voice is different

You've already used AI in a text window — type a question, wait, read the answer. Voice agents are a completely different experience: you speak, the AI responds in under a second, and a natural conversation unfolds. No typing, no waiting for paragraphs to load.

This isn't like Siri or Alexa, which route through rigid command parsing. Modern voice agents use a large language model directly — they can understand nuance, handle unexpected questions, and respond in context. A customer asks your café's voice bot "do you have anything vegan that isn't salad?" and it actually answers.

The key ingredient is streaming. Traditional AI waits until it has a complete response, then sends it all at once. Real-time voice agents stream audio continuously — they start speaking before they've finished "thinking," just like a person would.

🎤You speakmicrophone input

→

〰️Audio streamWebSocket / PCM16

→

🧠Gemini Livereal-time LLM

→

🔊AI speaksstreaming audio out

Key insight

The whole pipeline — from your voice to the AI's voice — can complete in under 600ms on a good connection. That's fast enough to feel like a real conversation, not a query-and-response system.

02 / The technology

Gemini Multimodal Live API

Google's Gemini Multimodal Live API (introduced with Gemini 2.0) is purpose-built for real-time, low-latency interaction. Unlike the standard Gemini API where you send a request and wait for a complete response, the Live API opens a persistent WebSocket connection that streams data both ways simultaneously.

What makes it "multimodal" is that you can send not just audio, but video frames too — meaning a voice agent could also see what you're pointing a camera at and respond to that context. A customer shows their phone to a camera and says "does this go with your menu?" — the agent can see the image and answer.

⚡

Sub-second latency

Streaming architecture means the AI starts responding before it's finished processing — no awkward waits.

🎙️

Native audio

Sends and receives raw PCM16 audio — no speech-to-text or text-to-speech middleware needed.

📷

Vision + voice

Stream video frames alongside audio. The model sees and hears at the same time, responding to both.

🔁

Interruption-aware

The agent can detect when you start talking mid-response and naturally stop, just like a human would.

📋

System instructions

Set a custom persona and knowledge base — your café bot knows your full menu, your gallery agent knows your artists.

🌐

Free to start

Google AI Studio free tier includes Live API access. No credit card required to prototype.

03 / Under the hood

How to build one

You don't need to be a programmer to understand the shape of a voice agent. Here's the four-part recipe:

1

Open a WebSocket connection

Your app connects to wss://generativelanguage.googleapis.com/ws/... with your API key. The connection stays open for the whole conversation.
2

Send a setup message

Tell the model what model to use (gemini-2.0-flash-live-001), what persona it should have, what it knows, and what voice to use. This is your system prompt in JSON form.
3

Stream audio chunks

Capture microphone audio, encode it as base64 PCM16 chunks (16kHz, mono), and send each chunk as a realtimeInput message. Do this continuously while recording.
4

Receive and play audio

The server streams back serverContent messages containing base64 audio chunks. Decode each chunk and feed it to a Web Audio API buffer for immediate playback.

// Minimal JavaScript voice agent skeleton
// Full example: https://github.com/google-gemini/multimodal-live-api-web-console

const ws = new WebSocket(
  `wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent?key=${API_KEY}`
);

// Step 2 — send setup
ws.addEventListener('open', () => {
  ws.send(JSON.stringify({
    setup: {
      model: "models/gemini-2.0-flash-live-001",
      generation_config: { response_modalities: ["AUDIO"] },
      system_instruction: {
        parts: [{ text: "You are Kia, the friendly voice assistant for Raglan Roast Café. You know our full menu and opening hours." }]
      }
    }
  }));
});

// Step 3 — stream mic audio
function sendAudioChunk(base64Chunk) {
  ws.send(JSON.stringify({
    realtimeInput: {
      mediaChunks: [{ mimeType: "audio/pcm;rate=16000", data: base64Chunk }]
    }
  }));
}

// Step 4 — receive and play
ws.addEventListener('message', (event) => {
  const msg = JSON.parse(event.data);
  const audioPart = msg?.serverContent?.modelTurn?.parts
    ?.find(p => p.inlineData?.mimeType?.includes('audio'));
  if (audioPart) playAudioChunk(audioPart.inlineData.data);
});

Try it without writing code

Google AI Studio has a built-in "Live" mode — click the microphone icon in any conversation. This is the same Gemini Live API running in the browser. You can test your system prompt idea there before writing a single line of code.

04 / Local applications

Real use cases for Raglan businesses

Voice agents aren't just for big tech companies. Here's what they could look like for businesses in Raglan right now — each one buildable in an afternoon with AI Studio's free tier.

☕ Café / restaurant

Phone order assistant

A voice agent that answers your business phone, takes orders, answers menu questions, and tells callers your hours — all without interrupting your staff. Forwards complicated calls to you.

🏄 Surf school

Booking & conditions bot

Callers ask about lesson availability, beginner requirements, and pricing. The agent answers in real time and takes their name and contact. Bookings land in a simple spreadsheet.

🎨 Gallery / studio

Audio guide

Visitors scan a QR code to launch a voice guide that knows every piece in the exhibition. They can ask "who made this?" or "what's the story behind this piece?" — and the guide answers naturally.

🛖 Accommodation

Late-night concierge

Guests arriving after hours can ask the voice agent about check-in, WiFi, parking, and local recommendations — without waking anyone up.

📚 Education

Tutoring companion

Students speak their question out loud and get a voice explanation back. More accessible than text for students who struggle with reading, and more natural for spoken-language subjects like te reo Māori.

♿ Accessibility

Eyes-free assistant

Users who can't easily type — or prefer not to — get full AI capabilities through voice alone. The multimodal version can also describe what the phone camera sees, helping low-vision users navigate the world.

05 / Choosing your interface

When to use voice — and when not to

Voice agents are powerful but not always the right choice. Here's how they compare to the text-based AI interfaces you've already learned:

Scenario	Text chat	Voice agent
Hands-free situations (driving, cooking)	✗ Needs screen	✓ Natural fit
Customer-facing phone line	✗ Wrong medium	✓ Perfect
Writing a long document	✓ Better output	✗ Awkward
Quick question while doing something	~ OK if near keyboard	✓ Much faster
Code generation	✓ Much better	✗ Hard to follow verbally
Accessibility (vision/motor)	~ Screen reader needed	✓ Designed for it
Noisy environment (loud café)	✓ Works fine	✗ Transcription errors
Complex multi-step tasks	✓ Easier to review	~ Possible but harder

Design principle

The best voice agents are designed for voice from the beginning — short, clear responses, natural pauses, no reading lists of bullet points. If your agent sounds like it's reading a webpage aloud, rebuild it for conversation.

06 / Your turn

Design your voice agent

Before writing any code, the most important step is designing the experience. What would the conversation feel like? What does your agent know? What's its personality?

Answer the questions below to sketch your voice agent concept. You can paste these answers into Google AI Studio's system prompt and test it immediately — no code required.

Voice Agent Design Canvas

Think of a real use case — your business, your school, your community. Fill this out as a system prompt you'd give to Gemini.

What is this agent for? (one sentence) What does the agent know? (list key facts) Personality and tone What should it NOT do or say?

Saved — paste this into Google AI Studio's system prompt and hit the microphone icon to test it live.

🎙️

You've covered real-time voice agents

From WebSocket streaming to Raglan use cases — voice is one of the most human-feeling interfaces AI can take.

⚖️ Ethics 💰 Economics 🎨 Creative Tools 🔬 Google AI Studio ⚡ Automation 🐙 Git & GitHub 💻 Claude Code 🎙️ Voice Agents ← you are here 🛡️ Security 🔗 MCP & RAG 🗂️ Admin & Ops 👥 HR & Comms 💬 Customer Service 📊 Finance 🍎 Education 🔥 Prompt Engineering 🏠 Workshop Home