Posted 2025-07-08Updated 2025-09-14Azure / AI / Voice Live API13 minutes read (About 1917 words)

Building Voice Agents with Azure Communication Services Voice Live API and Azure AI Agent Service

🎯 TL;DR: Real-time Voice Agent Implementation
This post walks through building a voice agent that connects traditional phone calls to Azure’s AI services. The system intercepts incoming calls via Azure Communication Services, streams audio in real-time to the Voice Live API, and processes conversations through pre-configured AI agents in Azure AI Studio. The implementation uses FastAPI for webhook handling, WebSocket connections for bidirectional audio streaming, and Azure Managed Identity for authentication (no API keys to manage). The architecture handles multiple concurrent calls on a single Python thread using asyncio.
Implementation details: Audio resampling between 16kHz (ACS requirement) and 24kHz (Voice Live requirement), connection resilience for preview services, and production deployment considerations. Full source code and documentation available here

Recently, I found myself co-leading an innovation project that pushed me into uncharted territory. The challenge? Developing a voice-based agentic solution with an ambitious goal - routing at least 25% of current contact center calls to AI voice agents. This was bleeding-edge stuff, with both the Azure Voice Live API and Azure AI Agent Service voice agents still in preview at the time of writing.

When you’re working with preview services, documentation is often sparse, and you quickly learn that reverse engineering network calls and maintaining close relationships with product teams becomes part of your daily routine. This blog post shares the practical lessons learned and the working solution we built to integrate these cutting-edge services.

The Innovation Challenge

Building a voice agent system that could handle real customer interactions meant tackling several complex requirements:

Real-time voice processing with minimal latency
Natural conversation flow without awkward pauses
Integration with existing contact center infrastructure
Scalability to handle multiple concurrent calls
Reliability for production use cases

With both Azure Voice Live API and Azure AI Voice Agent Service in preview, we were essentially building on shifting sands. But that’s what innovation is about - pushing boundaries and finding solutions where documentation doesn’t yet exist.

Understanding the Architecture

Our solution bridges Azure Communication Services (ACS) with Azure AI services to create an intelligent voice agent. Here’s how the pieces fit together:

graph TB
    subgraph "Phone Network"
        PSTN[📞 PSTN Number
+1-555-123-4567]
    end
    
    subgraph "Azure Communication Services"
        ACS[🔗 ACS Call Automation
Event Grid Webhooks]
        MEDIA[🎵 Media Streaming
WebSocket Audio]
    end
    
    subgraph "Python FastAPI App"
        API[🐍 FastAPI Server
localhost:49412]
        WS[🔌 WebSocket Handler
Audio Processing]
        HANDLER[⚡ Media Handler
Audio Resampling]
    end
    
    subgraph "Azure OpenAI"
        VOICE[🤖 Voice Live API
Agent Mode
gpt-4o Realtime]
        AGENT[👤 Pre-configured Agent
Azure AI Studio]
    end
    
    subgraph "Dev Infrastructure"
        TUNNEL[🚇 Dev Tunnel
Public HTTPS Endpoint]
    end
    
    PSTN -->|Incoming Call| ACS
    ACS -->|Webhook Events| TUNNEL
    TUNNEL -->|HTTPS| API
    ACS -->|WebSocket Audio| WS
    WS -->|PCM 16kHz| HANDLER
    HANDLER -->|PCM 24kHz| VOICE
    VOICE -->|Agent Processing| AGENT
    AGENT -->|AI Response| VOICE
    VOICE -->|AI Response| HANDLER
    HANDLER -->|PCM 16kHz| WS
    WS -->|Audio Stream| ACS
    ACS -->|Audio| PSTN
    
    style PSTN fill:#ff9999
    style ACS fill:#87CEEB
    style API fill:#90EE90
    style VOICE fill:#DDA0DD
    style TUNNEL fill:#F0E68C

Core Components

Azure Communication Services: Handles the telephony infrastructure, providing phone numbers and call routing
Voice Live API: Enables real-time speech recognition and synthesis with WebRTC streaming
Azure AI Agent Service: Provides the intelligence layer for understanding and responding to customer queries
WebSocket Bridge: Our custom Python application that connects these services

The Flow

When a customer calls, here’s what happens behind the scenes:

1
2
3

Customer Call → ACS Phone Number → Webhook to Our Service → 
WebSocket Connection → Voice Live API ↔ AI Agent Service → 
Real-time Voice Response → Customer

Setting Up the Foundation

Let’s walk through the practical implementation. You can find the complete code in my GitHub repository.

Prerequisites

First, you’ll need to set up several Azure services. Here’s what we discovered through trial and error:

# Required Azure services
- Azure Communication Services (with phone number provisioning)
- Azure AI Services (Speech Service enabled)
- Azure AI Agent Service (with voice capabilities)
- Azure App Service or Container Instance (for hosting)

Environment Configuration

One of the first challenges was figuring out all the required configuration parameters. Here’s what you’ll need:

# Essential environment variables (Using Azure Managed Identity - No API Keys!)
ACS_CONNECTION_STRING = "endpoint=https://your-acs.communication.azure.com/;accesskey=your-key"
AZURE_VOICE_LIVE_ENDPOINT = "https://your-aoai.cognitiveservices.azure.com/"
AGENT_ID = "your_agent_id_from_azure_ai_studio"
AGENT_PROJECT_NAME = "your_project_name"
BASE_URL = "https://your-tunnel-url.asse.devtunnels.ms"  # Dev Tunnel URL

Building the WebSocket Bridge

The heart of our solution is a Python application that acts as a bridge between ACS and the Voice Live API. This wasn’t documented anywhere - we had to figure it out by analyzing network traffic and experimenting.

Handling Incoming Calls

from fastapi import FastAPI, WebSocket
from azure.communication.callautomation import CallAutomationClient
from azure.identity import DefaultAzureCredential
import asyncio
import websockets

app = FastAPI()
call_automation_client = CallAutomationClient.from_connection_string(
    ACS_CONNECTION_STRING
)

@app.post("/api/incomingCall")
async def incoming_call(request: dict):
    """Handle incoming call webhook from ACS"""
    try:
        # Parse the incoming call context
        incoming_call_context = request.get("incomingCallContext")
        
        # Answer the call
        call_connection = call_automation_client.answer_call(
            incoming_call_context=incoming_call_context,
            callback_url=f"{CALLBACK_URI}/api/callbacks/{call_id}",
        )
        
        # Start WebSocket connection to Voice Live API
        asyncio.create_task(
            establish_voice_connection(call_connection.call_connection_id)
        )
        
        return {"status": "success"}
        
    except Exception as e:
        logger.error(f"Error handling incoming call: {e}")
        return {"error": str(e)}

Establishing the Voice Connection

This is where things got interesting. The Voice Live API uses WebRTC for real-time audio streaming, but the documentation was minimal. Here’s what we discovered:

async def establish_voice_connection(call_connection_id):
    """Establish WebSocket connection to Voice Live API using Azure Managed Identity"""
    
    # Get access token using managed identity
    from azure.identity import DefaultAzureCredential
    credential = DefaultAzureCredential()
    token = credential.get_token("https://cognitiveservices.azure.com/.default")
    
    # Construct the WebSocket URL for Voice Live API
    ws_url = f"wss://your-region.cognitiveservices.azure.com/openai/realtime?api-version=2024-10-01-preview"
    
    headers = {
        "Authorization": f"Bearer {token.token}",
        "OpenAI-Beta": "realtime=v1"
    }
    
    async with websockets.connect(ws_url, extra_headers=headers) as websocket:
        # Initialize session with Agent ID
        await websocket.send(json.dumps({
            "type": "session.update",
            "session": {
                "agent": {
                    "agent_id": AGENT_ID,
                    "project_name": AGENT_PROJECT_NAME
                }
            }
        }))
        
        # Handle bidirectional audio streaming
        await asyncio.gather(
            receive_audio_from_caller(websocket, call_connection_id),
            send_audio_to_caller(websocket, call_connection_id)
        )

Integrating with Azure AI Agent Service

The AI Agent Service provides the intelligence for our voice agent. Here’s how we connected it:

Processing Voice Input

async def process_voice_with_agent(audio_data, session_id):
    """Send audio directly to Voice Live API in Agent Mode"""
    
    # Using Azure Managed Identity - no API keys needed
    from azure.identity import DefaultAzureCredential
    credential = DefaultAzureCredential()
    token = credential.get_token("https://cognitiveservices.azure.com/.default")
    
    # Send audio input event to Voice Live API
    audio_event = {
        "type": "input_audio_buffer.append",
        "audio": base64.b64encode(audio_data).decode()
    }
    
    # Voice Live API will handle agent processing automatically
    # when configured with agent_id in session.update
    return audio_event

Handling Real-World Challenges

Working with preview services meant encountering numerous undocumented behaviors. Here are some key challenges we solved:

1. Audio Format Compatibility

The Voice Live API expects specific audio formats. We discovered through trial and error:

# Audio configuration that actually works (Voice Live API format)
AUDIO_CONFIG = {
    "format": "pcm16",  # 16-bit PCM for Voice Live API
    "sample_rate": 24000,  # 24kHz required by Voice Live
    "channels": 1  # Mono
}

# ACS requires 16kHz, so we need resampling
ACS_AUDIO_CONFIG = {
    "format": "pcm16",
    "sample_rate": 16000,  # ACS requirement
    "channels": 1
}

2. Latency Optimization

To achieve natural conversation flow, we implemented several optimizations:

# Start voice synthesis before full response is ready
async def stream_synthesize_speech(text_stream):
    """Synthesize speech in chunks for lower latency"""
    
    buffer = ""
    async for chunk in text_stream:
        buffer += chunk
        
        # Send to synthesis when we have a complete sentence
        if any(punct in buffer for punct in ['.', '!', '?']):
            await synthesize_and_send(buffer)
            buffer = ""

3. Connection Resilience

Preview services can be unstable. We added robust error handling:

async def maintain_connection(websocket, call_id):
    """Maintain WebSocket connection with automatic reconnection"""
    
    retry_count = 0
    max_retries = 3
    
    while retry_count < max_retries:
        try:
            await websocket.ping()
            await asyncio.sleep(30)  # Ping every 30 seconds
            
        except websockets.ConnectionClosed:
            logger.warning(f"Connection lost for call {call_id}")
            retry_count += 1
            await asyncio.sleep(2 ** retry_count)  # Exponential backoff
            
            # Attempt reconnection
            websocket = await reconnect_websocket(call_id)

Deployment Considerations

When deploying this solution, we learned several important lessons:

Container Deployment

We packaged our Python application as a container for easier deployment:

FROM python:3.11-slim

WORKDIR /app

# Install system dependencies for audio processing
RUN apt-get update && apt-get install -y \
    libopus0 \
    libopus-dev \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "start.py"]

Scaling Considerations

For handling multiple concurrent calls:

Use Azure Container Instances or App Service with autoscaling
Implement connection pooling for WebSocket connections
Monitor memory usage - audio processing can be memory-intensive

Monitoring and Debugging

Working with preview services means extensive logging is crucial:

import logging
from azure.monitor.opentelemetry import configure_azure_monitor

# Configure Azure Monitor for production debugging
configure_azure_monitor(
    connection_string=APPLICATIONINSIGHTS_CONNECTION_STRING
)

# Log all WebSocket events
logging.getLogger('websockets').setLevel(logging.DEBUG)

Lessons Learned

After weeks of development and close collaboration with Azure product teams, here are our key takeaways:

Preview Services Require Patience: Be prepared for undocumented features and changing APIs
Network Analysis is Your Friend: Tools like Wireshark helped us understand the protocol
Build in Resilience: Assume connections will drop and services will be intermittently unavailable
Start Simple: Get basic voice working before adding complex AI interactions
Monitor Everything: You’ll need extensive logging to debug issues in production

Get Started

Ready to build your own voice agent? Check out the complete implementation in my GitHub repository. The repository includes:

Complete Python application code
Deployment scripts and Docker configuration
Environment setup instructions
Troubleshooting guide

Remember, innovation often means venturing into undocumented territory. Don’t be afraid to experiment, reverse-engineer, and collaborate with product teams. The future of voice-based AI agents is being written right now, and you can be part of it.

References

Azure Voice Live API Documentation
Azure AI Agent Service Overview
Complete Code Repository
Azure Communication Services Documentation
Main image generated by DALL-E

Building Voice Agents with Azure Communication Services Voice Live API and Azure AI Agent Service

https://clouddev.blog/Azure/AI/Voice-Live-API/building-voice-agents-with-azure-communication-services-voice-live-api-and-azure-ai-agent-service/

Author

Ricky Gummadi

Posted on

2025-07-08

Updated on

2025-09-14

Licensed under

Building Voice Agents with Azure Communication Services Voice Live API and Azure AI Agent Service

The Innovation Challenge

Understanding the Architecture

Core Components

The Flow

Setting Up the Foundation

Prerequisites

Environment Configuration

Building the WebSocket Bridge

Handling Incoming Calls

Establishing the Voice Connection

Integrating with Azure AI Agent Service

Processing Voice Input

Handling Real-World Challenges

1. Audio Format Compatibility

2. Latency Optimization

3. Connection Resilience

Deployment Considerations

Container Deployment

Scaling Considerations

Monitoring and Debugging

Lessons Learned

Get Started

References

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Catalogue

Categories

Archives

follow.it

Recents

Advertisement

Tags