Level 2: Pipelines~25 min

Voice Assistant Pipeline

Chain Speech-to-Text, LLM, and Text-to-Speech into a conversational voice interface. Build your own Alexa/Siri.

The Voice Assistant Loop

A voice assistant is a pipeline of three building blocks you already know, connected in a loop:

Speech-to-Text (Whisper)

Convert the user's spoken audio into text. Whisper handles accents, background noise, and multiple languages.

LLM Processing

Send the transcribed text to an LLM (GPT-4, Claude, etc.) to generate a response. This is where the "intelligence" happens.

Text-to-Speech (TTS)

Convert the LLM's text response back into spoken audio. Modern TTS sounds remarkably natural.

The Loop

User speaks -> [Whisper] -> Text -> [LLM] -> Response -> [TTS] -> Audio plays

The challenge isn't building each block - you already know how. The challenge is latency. Users expect responses in under 3 seconds.

Complete Voice Assistant Code

Here's a complete, working voice assistant using OpenAI's APIs for all three components:

Install Dependencies

Python

pip install openai sounddevice soundfile numpy pygame

Complete Voice Assistant Loop

OpenAI APIs

from openai import OpenAI
import sounddevice as sd
import soundfile as sf
import numpy as np
import time

client = OpenAI()

def record_audio(duration=5, sample_rate=16000):
    """Record audio from microphone."""
    print("Listening...")
    audio = sd.rec(
        int(duration * sample_rate),
        samplerate=sample_rate,
        channels=1,
        dtype='float32'
    )
    sd.wait()  # Wait until recording is finished
    return audio, sample_rate

def transcribe_audio(audio, sample_rate):
    """Convert audio to text using Whisper."""
    # Save to temporary file (Whisper API requires file)
    sf.write("temp_input.wav", audio, sample_rate)

    with open("temp_input.wav", "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=f
        )
    return transcript.text

def get_llm_response(user_text, conversation_history):
    """Get response from LLM."""
    conversation_history.append({
        "role": "user",
        "content": user_text
    })

    response = client.chat.completions.create(
        model="gpt-4",
        messages=conversation_history,
        max_tokens=150  # Keep responses concise for voice
    )

    assistant_message = response.choices[0].message.content
    conversation_history.append({
        "role": "assistant",
        "content": assistant_message
    })

    return assistant_message

def text_to_speech(text):
    """Convert text to speech using OpenAI TTS."""
    response = client.audio.speech.create(
        model="tts-1",
        voice="alloy",  # Options: alloy, echo, fable, onyx, nova, shimmer
        input=text
    )

    # Save and play the audio
    response.stream_to_file("temp_response.mp3")
    return "temp_response.mp3"

def play_audio(file_path):
    """Play audio file."""
    import pygame
    pygame.mixer.init()
    pygame.mixer.music.load(file_path)
    pygame.mixer.music.play()
    while pygame.mixer.music.get_busy():
        time.sleep(0.1)

def voice_assistant():
    """Main voice assistant loop."""
    print("Voice Assistant Ready")
    print("=" * 40)

    conversation_history = [{
        "role": "system",
        "content": "You are a helpful voice assistant. Keep responses concise and conversational, under 2-3 sentences."
    }]

    while True:
        try:
            # 1. Record user speech
            audio, sr = record_audio(duration=5)

            # 2. Transcribe with Whisper
            start = time.time()
            user_text = transcribe_audio(audio, sr)
            transcribe_time = time.time() - start
            print(f"You said: {user_text}")

            if not user_text.strip():
                print("(No speech detected)")
                continue

            # Check for exit commands
            if any(word in user_text.lower() for word in ["goodbye", "exit", "quit"]):
                print("Goodbye!")
                break

            # 3. Get LLM response
            start = time.time()
            response_text = get_llm_response(user_text, conversation_history)
            llm_time = time.time() - start
            print(f"Assistant: {response_text}")

            # 4. Text-to-speech
            start = time.time()
            audio_file = text_to_speech(response_text)
            tts_time = time.time() - start

            # 5. Play response
            play_audio(audio_file)

            # Print timing breakdown
            total_time = transcribe_time + llm_time + tts_time
            print(f"[Whisper: {transcribe_time:.2f}s | LLM: {llm_time:.2f}s | TTS: {tts_time:.2f}s | Total: {total_time:.2f}s]")
            print("-" * 40)

        except KeyboardInterrupt:
            print("\nExiting...")
            break

if __name__ == "__main__":
    voice_assistant()

Understanding Latency

Latency is the enemy of natural conversation. Here's where time goes in a typical request:

Typical Latency Breakdown (API-based)

Whisper transcription

500-800ms

LLM response (GPT-4)

1000-2000ms

TTS generation

300-500ms

Audio playback start

100-200ms

Total round-trip2-3.5 seconds

Why This Matters

Human conversation has natural pauses of 200-500ms. Anything over 2 seconds feels unnatural. Alexa/Siri target <1.5s end-to-end, which requires streaming and local processing.

Optimizing Latency

There are several strategies to reduce perceived latency:

1. Streaming TTS

Start playing audio as soon as the first chunk is ready, rather than waiting for the complete response.

# Stream TTS output
response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input=text,
    response_format="opus"  # Better for streaming
)

# Process chunks as they arrive
for chunk in response.iter_bytes(chunk_size=4096):
    audio_player.feed(chunk)  # Play immediately

2. Faster Local Models

Use faster-whisper locally to cut transcription time by 4x. Consider local LLMs for simple queries.

from faster_whisper import WhisperModel

# Local Whisper: ~100-200ms for 5s audio
model = WhisperModel("base", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=1)
text = " ".join([s.text for s in segments])

3. Streaming LLM + TTS Pipeline

Stream LLM tokens directly into TTS. Start speaking the first sentence while still generating the rest.

# Stream LLM response
stream = client.chat.completions.create(
    model="gpt-4",
    messages=messages,
    stream=True
)

buffer = ""
for chunk in stream:
    if chunk.choices[0].delta.content:
        buffer += chunk.choices[0].delta.content

        # When we hit a sentence boundary, send to TTS
        if buffer.endswith(('.', '!', '?')):
            speak_async(buffer)
            buffer = ""

4. Use Faster Models

GPT-3.5-turbo is 3-5x faster than GPT-4. For simple voice queries, it's often sufficient.

Model	Typical Latency
GPT-4	1500-2500ms
GPT-4-turbo	800-1500ms
GPT-3.5-turbo	300-600ms
Claude 3 Haiku	200-400ms

Optional: Wake Word Detection

Wake words ("Hey Siri", "Alexa") let the assistant listen continuously without processing everything. This runs locally with minimal CPU.

Install Porcupine (Wake Word Engine)

Python

pip install pvporcupine

Wake Word Detection with Porcupine

Picovoice

import pvporcupine
import pyaudio
import struct

# Initialize Porcupine with built-in wake word
porcupine = pvporcupine.create(
    access_key='YOUR_PICOVOICE_ACCESS_KEY',
    keywords=['computer']  # Built-in wake words
)

# Audio stream setup
pa = pyaudio.PyAudio()
audio_stream = pa.open(
    rate=porcupine.sample_rate,
    channels=1,
    format=pyaudio.paInt16,
    input=True,
    frames_per_buffer=porcupine.frame_length
)

print("Listening for wake word 'computer'...")

while True:
    # Read audio frame
    pcm = audio_stream.read(porcupine.frame_length)
    pcm = struct.unpack_from("h" * porcupine.frame_length, pcm)

    # Check for wake word
    keyword_index = porcupine.process(pcm)

    if keyword_index >= 0:
        print("Wake word detected! Listening for command...")
        # Now record and process the actual command
        audio, sr = record_audio(duration=5)
        process_voice_command(audio, sr)

<1%

CPU usage (idle)

~1ms

Detection latency

Offline

Runs locally

Error Handling and Edge Cases

Voice interfaces need robust error handling for a good user experience:

Silence / No Speech Detected

def has_speech(audio, threshold=0.01):
    """Check if audio contains speech."""
    rms = np.sqrt(np.mean(audio**2))
    return rms > threshold

audio, sr = record_audio(duration=5)
if not has_speech(audio):
    speak("I didn't catch that. Could you repeat?")
    continue

Timeout Handling

import asyncio

async def get_response_with_timeout(text, timeout=10):
    """Get LLM response with timeout."""
    try:
        response = await asyncio.wait_for(
            get_llm_response_async(text),
            timeout=timeout
        )
        return response
    except asyncio.TimeoutError:
        return "I'm taking too long to think. Let me try again."

# Alternative: Use shorter responses for voice
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=messages,
    max_tokens=100,  # Limit length
    timeout=5.0
)

Unclear Speech Handling

# Check transcription confidence (if available)
# Or use heuristics for uncertain transcriptions

def needs_clarification(text):
    """Check if we should ask for clarification."""
    # Very short responses often indicate poor transcription
    if len(text.split()) < 2:
        return True
    # Common misheard patterns
    if text.lower() in ["um", "uh", "hmm", "[inaudible]"]:
        return True
    return False

if needs_clarification(user_text):
    speak("Sorry, I didn't understand. Could you say that again?")

Conversation State Management

For multi-turn conversations, maintain context between exchanges:

Stateful Conversation

Context Management

class VoiceConversation:
    def __init__(self, system_prompt=None):
        self.messages = []
        if system_prompt:
            self.messages.append({
                "role": "system",
                "content": system_prompt
            })
        self.max_history = 10  # Keep last N exchanges

    def add_user_message(self, text):
        self.messages.append({"role": "user", "content": text})
        self._trim_history()

    def add_assistant_message(self, text):
        self.messages.append({"role": "assistant", "content": text})
        self._trim_history()

    def _trim_history(self):
        """Keep conversation within token limits."""
        # Keep system prompt + last N messages
        if len(self.messages) > self.max_history + 1:
            system = self.messages[0] if self.messages[0]["role"] == "system" else None
            recent = self.messages[-(self.max_history):]
            self.messages = [system] + recent if system else recent

    def get_response(self, user_text):
        self.add_user_message(user_text)

        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=self.messages,
            max_tokens=150
        )

        assistant_text = response.choices[0].message.content
        self.add_assistant_message(assistant_text)
        return assistant_text

# Usage
conversation = VoiceConversation(
    system_prompt="You are a helpful voice assistant. Be concise."
)

# First exchange
response = conversation.get_response("What's the weather like?")
# "I don't have access to weather data. Would you like me to help with something else?"

# Follow-up (has context)
response = conversation.get_response("Tell me a joke instead")
# "Why don't scientists trust atoms? Because they make up everything!"

Key Takeaways

1
Three-stage pipeline - Speech-to-Text (Whisper) + LLM + Text-to-Speech (TTS) creates a voice loop.
2
Latency is critical - Target <3s total. Streaming and faster models are essential for natural conversation.
3
Wake words enable always-on - Local detection with <1% CPU lets you listen continuously.
4
Handle errors gracefully - Silence detection, timeouts, and clarification requests make for a robust UX.

Latency Reference

Stage	API-based	Optimized
Whisper transcription	500-800ms	100-200ms (local)
LLM response	1000-2000ms (GPT-4)	300-600ms (GPT-3.5)
TTS generation	300-500ms	~100ms (streaming)
Total	2-3.5 seconds	500ms-1 second

Practice Exercise

Build and improve your voice assistant:

1.Run the basic voice assistant code. Have a conversation and note the latency.
2.Switch from GPT-4 to GPT-3.5-turbo. How much faster is it?
3.Add the silence detection code. Does it improve the experience?
4.If you have a GPU, try faster-whisper locally and compare latency.

Next: Basic RAG Pipeline Previous: Caption + Search