Voice Assistant Pipeline
Chain Speech-to-Text, LLM, and Text-to-Speech into a conversational voice interface. Build your own Alexa/Siri.
The Voice Assistant Loop
A voice assistant is a pipeline of three building blocks you already know, connected in a loop:
Speech-to-Text (Whisper)
Convert the user's spoken audio into text. Whisper handles accents, background noise, and multiple languages.
LLM Processing
Send the transcribed text to an LLM (GPT-4, Claude, etc.) to generate a response. This is where the "intelligence" happens.
Text-to-Speech (TTS)
Convert the LLM's text response back into spoken audio. Modern TTS sounds remarkably natural.
The Loop
The challenge isn't building each block - you already know how. The challenge is latency. Users expect responses in under 3 seconds.
Complete Voice Assistant Code
Here's a complete, working voice assistant using OpenAI's APIs for all three components:
Install Dependencies
PythonComplete Voice Assistant Loop
OpenAI APIsfrom openai import OpenAI
import sounddevice as sd
import soundfile as sf
import numpy as np
import time
client = OpenAI()
def record_audio(duration=5, sample_rate=16000):
"""Record audio from microphone."""
print("Listening...")
audio = sd.rec(
int(duration * sample_rate),
samplerate=sample_rate,
channels=1,
dtype='float32'
)
sd.wait() # Wait until recording is finished
return audio, sample_rate
def transcribe_audio(audio, sample_rate):
"""Convert audio to text using Whisper."""
# Save to temporary file (Whisper API requires file)
sf.write("temp_input.wav", audio, sample_rate)
with open("temp_input.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f
)
return transcript.text
def get_llm_response(user_text, conversation_history):
"""Get response from LLM."""
conversation_history.append({
"role": "user",
"content": user_text
})
response = client.chat.completions.create(
model="gpt-4",
messages=conversation_history,
max_tokens=150 # Keep responses concise for voice
)
assistant_message = response.choices[0].message.content
conversation_history.append({
"role": "assistant",
"content": assistant_message
})
return assistant_message
def text_to_speech(text):
"""Convert text to speech using OpenAI TTS."""
response = client.audio.speech.create(
model="tts-1",
voice="alloy", # Options: alloy, echo, fable, onyx, nova, shimmer
input=text
)
# Save and play the audio
response.stream_to_file("temp_response.mp3")
return "temp_response.mp3"
def play_audio(file_path):
"""Play audio file."""
import pygame
pygame.mixer.init()
pygame.mixer.music.load(file_path)
pygame.mixer.music.play()
while pygame.mixer.music.get_busy():
time.sleep(0.1)
def voice_assistant():
"""Main voice assistant loop."""
print("Voice Assistant Ready")
print("=" * 40)
conversation_history = [{
"role": "system",
"content": "You are a helpful voice assistant. Keep responses concise and conversational, under 2-3 sentences."
}]
while True:
try:
# 1. Record user speech
audio, sr = record_audio(duration=5)
# 2. Transcribe with Whisper
start = time.time()
user_text = transcribe_audio(audio, sr)
transcribe_time = time.time() - start
print(f"You said: {user_text}")
if not user_text.strip():
print("(No speech detected)")
continue
# Check for exit commands
if any(word in user_text.lower() for word in ["goodbye", "exit", "quit"]):
print("Goodbye!")
break
# 3. Get LLM response
start = time.time()
response_text = get_llm_response(user_text, conversation_history)
llm_time = time.time() - start
print(f"Assistant: {response_text}")
# 4. Text-to-speech
start = time.time()
audio_file = text_to_speech(response_text)
tts_time = time.time() - start
# 5. Play response
play_audio(audio_file)
# Print timing breakdown
total_time = transcribe_time + llm_time + tts_time
print(f"[Whisper: {transcribe_time:.2f}s | LLM: {llm_time:.2f}s | TTS: {tts_time:.2f}s | Total: {total_time:.2f}s]")
print("-" * 40)
except KeyboardInterrupt:
print("\nExiting...")
break
if __name__ == "__main__":
voice_assistant()Understanding Latency
Latency is the enemy of natural conversation. Here's where time goes in a typical request:
Typical Latency Breakdown (API-based)
Why This Matters
Human conversation has natural pauses of 200-500ms. Anything over 2 seconds feels unnatural. Alexa/Siri target <1.5s end-to-end, which requires streaming and local processing.
Optimizing Latency
There are several strategies to reduce perceived latency:
1. Streaming TTS
Start playing audio as soon as the first chunk is ready, rather than waiting for the complete response.
# Stream TTS output
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input=text,
response_format="opus" # Better for streaming
)
# Process chunks as they arrive
for chunk in response.iter_bytes(chunk_size=4096):
audio_player.feed(chunk) # Play immediately2. Faster Local Models
Use faster-whisper locally to cut transcription time by 4x. Consider local LLMs for simple queries.
from faster_whisper import WhisperModel
# Local Whisper: ~100-200ms for 5s audio
model = WhisperModel("base", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=1)
text = " ".join([s.text for s in segments])3. Streaming LLM + TTS Pipeline
Stream LLM tokens directly into TTS. Start speaking the first sentence while still generating the rest.
# Stream LLM response
stream = client.chat.completions.create(
model="gpt-4",
messages=messages,
stream=True
)
buffer = ""
for chunk in stream:
if chunk.choices[0].delta.content:
buffer += chunk.choices[0].delta.content
# When we hit a sentence boundary, send to TTS
if buffer.endswith(('.', '!', '?')):
speak_async(buffer)
buffer = ""4. Use Faster Models
GPT-3.5-turbo is 3-5x faster than GPT-4. For simple voice queries, it's often sufficient.
| Model | Typical Latency |
|---|---|
| GPT-4 | 1500-2500ms |
| GPT-4-turbo | 800-1500ms |
| GPT-3.5-turbo | 300-600ms |
| Claude 3 Haiku | 200-400ms |
Optional: Wake Word Detection
Wake words ("Hey Siri", "Alexa") let the assistant listen continuously without processing everything. This runs locally with minimal CPU.
Install Porcupine (Wake Word Engine)
PythonWake Word Detection with Porcupine
Picovoiceimport pvporcupine
import pyaudio
import struct
# Initialize Porcupine with built-in wake word
porcupine = pvporcupine.create(
access_key='YOUR_PICOVOICE_ACCESS_KEY',
keywords=['computer'] # Built-in wake words
)
# Audio stream setup
pa = pyaudio.PyAudio()
audio_stream = pa.open(
rate=porcupine.sample_rate,
channels=1,
format=pyaudio.paInt16,
input=True,
frames_per_buffer=porcupine.frame_length
)
print("Listening for wake word 'computer'...")
while True:
# Read audio frame
pcm = audio_stream.read(porcupine.frame_length)
pcm = struct.unpack_from("h" * porcupine.frame_length, pcm)
# Check for wake word
keyword_index = porcupine.process(pcm)
if keyword_index >= 0:
print("Wake word detected! Listening for command...")
# Now record and process the actual command
audio, sr = record_audio(duration=5)
process_voice_command(audio, sr)CPU usage (idle)
Detection latency
Runs locally
Error Handling and Edge Cases
Voice interfaces need robust error handling for a good user experience:
Silence / No Speech Detected
def has_speech(audio, threshold=0.01):
"""Check if audio contains speech."""
rms = np.sqrt(np.mean(audio**2))
return rms > threshold
audio, sr = record_audio(duration=5)
if not has_speech(audio):
speak("I didn't catch that. Could you repeat?")
continueTimeout Handling
import asyncio
async def get_response_with_timeout(text, timeout=10):
"""Get LLM response with timeout."""
try:
response = await asyncio.wait_for(
get_llm_response_async(text),
timeout=timeout
)
return response
except asyncio.TimeoutError:
return "I'm taking too long to think. Let me try again."
# Alternative: Use shorter responses for voice
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=messages,
max_tokens=100, # Limit length
timeout=5.0
)Unclear Speech Handling
# Check transcription confidence (if available)
# Or use heuristics for uncertain transcriptions
def needs_clarification(text):
"""Check if we should ask for clarification."""
# Very short responses often indicate poor transcription
if len(text.split()) < 2:
return True
# Common misheard patterns
if text.lower() in ["um", "uh", "hmm", "[inaudible]"]:
return True
return False
if needs_clarification(user_text):
speak("Sorry, I didn't understand. Could you say that again?")Conversation State Management
For multi-turn conversations, maintain context between exchanges:
Stateful Conversation
Context Managementclass VoiceConversation:
def __init__(self, system_prompt=None):
self.messages = []
if system_prompt:
self.messages.append({
"role": "system",
"content": system_prompt
})
self.max_history = 10 # Keep last N exchanges
def add_user_message(self, text):
self.messages.append({"role": "user", "content": text})
self._trim_history()
def add_assistant_message(self, text):
self.messages.append({"role": "assistant", "content": text})
self._trim_history()
def _trim_history(self):
"""Keep conversation within token limits."""
# Keep system prompt + last N messages
if len(self.messages) > self.max_history + 1:
system = self.messages[0] if self.messages[0]["role"] == "system" else None
recent = self.messages[-(self.max_history):]
self.messages = [system] + recent if system else recent
def get_response(self, user_text):
self.add_user_message(user_text)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=self.messages,
max_tokens=150
)
assistant_text = response.choices[0].message.content
self.add_assistant_message(assistant_text)
return assistant_text
# Usage
conversation = VoiceConversation(
system_prompt="You are a helpful voice assistant. Be concise."
)
# First exchange
response = conversation.get_response("What's the weather like?")
# "I don't have access to weather data. Would you like me to help with something else?"
# Follow-up (has context)
response = conversation.get_response("Tell me a joke instead")
# "Why don't scientists trust atoms? Because they make up everything!"Key Takeaways
- 1
Three-stage pipeline - Speech-to-Text (Whisper) + LLM + Text-to-Speech (TTS) creates a voice loop.
- 2
Latency is critical - Target <3s total. Streaming and faster models are essential for natural conversation.
- 3
Wake words enable always-on - Local detection with <1% CPU lets you listen continuously.
- 4
Handle errors gracefully - Silence detection, timeouts, and clarification requests make for a robust UX.
Latency Reference
| Stage | API-based | Optimized |
|---|---|---|
| Whisper transcription | 500-800ms | 100-200ms (local) |
| LLM response | 1000-2000ms (GPT-4) | 300-600ms (GPT-3.5) |
| TTS generation | 300-500ms | ~100ms (streaming) |
| Total | 2-3.5 seconds | 500ms-1 second |
Practice Exercise
Build and improve your voice assistant:
- 1.Run the basic voice assistant code. Have a conversation and note the latency.
- 2.Switch from GPT-4 to GPT-3.5-turbo. How much faster is it?
- 3.Add the silence detection code. Does it improve the experience?
- 4.If you have a GPU, try faster-whisper locally and compare latency.