Building Voice Agents with Azure Communication Services Voice Live API and Azure AI Agent Service
🎯 TL;DR: Real-time Voice Agent Implementation
This post walks through building a voice agent that connects traditional phone calls to Azure’s AI services. The system intercepts incoming calls via Azure Communication Services, streams audio in real-time to the Voice Live API, and processes conversations through pre-configured AI agents in Azure AI Studio. The implementation uses FastAPI for webhook handling, WebSocket connections for bidirectional audio streaming, and Azure Managed Identity for authentication (no API keys to manage). The architecture handles multiple concurrent calls on a single Python thread using asyncio.
Implementation details: Audio resampling between 16kHz (ACS requirement) and 24kHz (Voice Live requirement), connection resilience for preview services, and production deployment considerations. Full source code and documentation available here
Recently, I found myself co-leading an innovation project that pushed me into uncharted territory. The challenge? Developing a voice-based agentic solution with an ambitious goal - routing at least 25% of current contact center calls to AI voice agents. This was bleeding-edge stuff, with both the Azure Voice Live API and Azure AI Agent Service voice agents still in preview at the time of writing.
When you’re working with preview services, documentation is often sparse, and you quickly learn that reverse engineering network calls and maintaining close relationships with product teams becomes part of your daily routine. This blog post shares the practical lessons learned and the working solution we built to integrate these cutting-edge services.
The Innovation Challenge
Building a voice agent system that could handle real customer interactions meant tackling several complex requirements:
- Real-time voice processing with minimal latency
- Natural conversation flow without awkward pauses
- Integration with existing contact center infrastructure
- Scalability to handle multiple concurrent calls
- Reliability for production use cases
With both Azure Voice Live API and Azure AI Voice Agent Service in preview, we were essentially building on shifting sands. But that’s what innovation is about - pushing boundaries and finding solutions where documentation doesn’t yet exist.
Understanding the Architecture
Our solution bridges Azure Communication Services (ACS) with Azure AI services to create an intelligent voice agent. Here’s how the pieces fit together:
graph TB subgraph "Phone Network" PSTN[📞 PSTN Number
+1-555-123-4567] end subgraph "Azure Communication Services" ACS[🔗 ACS Call Automation
Event Grid Webhooks] MEDIA[🎵 Media Streaming
WebSocket Audio] end subgraph "Python FastAPI App" API[🐍 FastAPI Server
localhost:49412] WS[🔌 WebSocket Handler
Audio Processing] HANDLER[⚡ Media Handler
Audio Resampling] end subgraph "Azure OpenAI" VOICE[🤖 Voice Live API
Agent Mode
gpt-4o Realtime] AGENT[👤 Pre-configured Agent
Azure AI Studio] end subgraph "Dev Infrastructure" TUNNEL[🚇 Dev Tunnel
Public HTTPS Endpoint] end PSTN -->|Incoming Call| ACS ACS -->|Webhook Events| TUNNEL TUNNEL -->|HTTPS| API ACS -->|WebSocket Audio| WS WS -->|PCM 16kHz| HANDLER HANDLER -->|PCM 24kHz| VOICE VOICE -->|Agent Processing| AGENT AGENT -->|AI Response| VOICE VOICE -->|AI Response| HANDLER HANDLER -->|PCM 16kHz| WS WS -->|Audio Stream| ACS ACS -->|Audio| PSTN style PSTN fill:#ff9999 style ACS fill:#87CEEB style API fill:#90EE90 style VOICE fill:#DDA0DD style TUNNEL fill:#F0E68C
Core Components
- Azure Communication Services: Handles the telephony infrastructure, providing phone numbers and call routing
- Voice Live API: Enables real-time speech recognition and synthesis with WebRTC streaming
- Azure AI Agent Service: Provides the intelligence layer for understanding and responding to customer queries
- WebSocket Bridge: Our custom Python application that connects these services
The Flow
When a customer calls, here’s what happens behind the scenes:
1 | Customer Call → ACS Phone Number → Webhook to Our Service → |
Setting Up the Foundation
Let’s walk through the practical implementation. You can find the complete code in my GitHub repository.
Prerequisites
First, you’ll need to set up several Azure services. Here’s what we discovered through trial and error:
1 | # Required Azure services |
Environment Configuration
One of the first challenges was figuring out all the required configuration parameters. Here’s what you’ll need:
1 | # Essential environment variables (Using Azure Managed Identity - No API Keys!) |
Building the WebSocket Bridge
The heart of our solution is a Python application that acts as a bridge between ACS and the Voice Live API. This wasn’t documented anywhere - we had to figure it out by analyzing network traffic and experimenting.
Handling Incoming Calls
1 | from fastapi import FastAPI, WebSocket |
Establishing the Voice Connection
This is where things got interesting. The Voice Live API uses WebRTC for real-time audio streaming, but the documentation was minimal. Here’s what we discovered:
1 | async def establish_voice_connection(call_connection_id): |
Integrating with Azure AI Agent Service
The AI Agent Service provides the intelligence for our voice agent. Here’s how we connected it:
Processing Voice Input
1 | async def process_voice_with_agent(audio_data, session_id): |
Handling Real-World Challenges
Working with preview services meant encountering numerous undocumented behaviors. Here are some key challenges we solved:
1. Audio Format Compatibility
The Voice Live API expects specific audio formats. We discovered through trial and error:
1 | # Audio configuration that actually works (Voice Live API format) |
2. Latency Optimization
To achieve natural conversation flow, we implemented several optimizations:
1 | # Start voice synthesis before full response is ready |
3. Connection Resilience
Preview services can be unstable. We added robust error handling:
1 | async def maintain_connection(websocket, call_id): |
Deployment Considerations
When deploying this solution, we learned several important lessons:
Container Deployment
We packaged our Python application as a container for easier deployment:
1 | FROM python:3.11-slim |
Scaling Considerations
For handling multiple concurrent calls:
- Use Azure Container Instances or App Service with autoscaling
- Implement connection pooling for WebSocket connections
- Monitor memory usage - audio processing can be memory-intensive
Monitoring and Debugging
Working with preview services means extensive logging is crucial:
1 | import logging |
Lessons Learned
After weeks of development and close collaboration with Azure product teams, here are our key takeaways:
- Preview Services Require Patience: Be prepared for undocumented features and changing APIs
- Network Analysis is Your Friend: Tools like Wireshark helped us understand the protocol
- Build in Resilience: Assume connections will drop and services will be intermittently unavailable
- Start Simple: Get basic voice working before adding complex AI interactions
- Monitor Everything: You’ll need extensive logging to debug issues in production
Get Started
Ready to build your own voice agent? Check out the complete implementation in my GitHub repository. The repository includes:
- Complete Python application code
- Deployment scripts and Docker configuration
- Environment setup instructions
- Troubleshooting guide
Remember, innovation often means venturing into undocumented territory. Don’t be afraid to experiment, reverse-engineer, and collaborate with product teams. The future of voice-based AI agents is being written right now, and you can be part of it.
References
Building Voice Agents with Azure Communication Services Voice Live API and Azure AI Agent Service