Custom Voices in Azure OpenAI Realtime with Azure Speech Services

Custom Voices in Azure OpenAI Realtime with Azure Speech Services

Building realtime voice-enabled applications with Azure OpenAI’s GPT-4o Realtime model is incredibly powerful, but there’s one significant limitation that can be a deal-breaker for many use cases: you’re stuck with OpenAI’s predefined voices like “sage”, “alloy”, “echo”, “fable”, “onyx”, and “nova”.

What if you’re building a branded customer service bot that needs to match your company’s voice identity? Or developing a therapeutic application for children with autism where the voice quality and tone are crucial for engagement? What if your users need to interrupt the assistant naturally, just like in real human conversations?

In this comprehensive guide, I’ll show you exactly how I solved these challenges by building a hybrid solution that combines the conversational intelligence of GPT-4o Realtime with the voice flexibility of Azure Speech Services. We’ll dive deep into the implementation, covering everything from the initial problem to the complete working solution.

flowchart TD
    A[👤 User speaks] --> B[🎤 Microphone Input]
    B --> C{Barge-in Detection
Audio Level > Threshold?} C -->|Yes| D[🛑 Stop Azure Speech] C -->|No| E[📡 Stream to GPT-4o Realtime] E --> F[🧠 GPT-4o Processing] F --> G[📝 Text Response
ContentModalities.Text] G --> H[🗣️ Azure Speech Services
Custom/Neural Voice] H --> I[🔊 Audio Output] D --> E I --> J[👂 User hears response] J --> A style A fill:#e1f5fe style D fill:#ffebee style G fill:#f3e5f5 style H fill:#e8f5e8 style I fill:#fff3e0

The real problem: Why GPT-4o Realtime’s voice limitations matter

When you’re working with Azure OpenAI’s GPT-4o Realtime API, the standard approach involves configuring a RealtimeConversationSession with one of the predefined voices. While these voices are high-quality, they create several significant limitations:

1. Limited voice selection

You’re restricted to just six built-in voices. There’s no access to Azure Speech Services’ extensive catalog of 400+ neural voices across 140+ languages and locales. You can’t use premium voices like Jenny Neural (en-US) or specialized voices optimized for different use cases.

2. No custom neural voices

Perhaps most importantly, you can’t integrate custom neural voices (CNV) that you’ve trained in Azure Speech Studio. This is crucial for:

  • Brand consistency: Companies that have invested in custom voice branding
  • Specialized applications: Healthcare, education, or accessibility apps requiring specific voice characteristics
  • Multilingual scenarios: Custom voices trained on specific accents or dialects

3. No natural interruption (barge-in)

The built-in system doesn’t provide a way for users to naturally interrupt the assistant mid-response. In real conversations, we constantly interrupt each other—it’s natural and expected. Without this capability, your bot feels robotic and frustrating to use.

4. Limited voice control

You can’t dynamically adjust speech rate, pitch, or emphasis using SSML (Speech Synthesis Markup Language) that Azure Speech Services supports.

The solution: Hybrid architecture with Azure Speech Services

The solution I’ve developed bypasses GPT-4o’s built-in text-to-speech entirely and routes the conversation text through Azure Speech Services. Here’s the high-level architecture:

  1. Configure GPT-4o for text-only output: Disable built-in audio synthesis
  2. Stream and capture text responses: Collect the assistant’s text as it streams
  3. Route text to Azure Speech Services: Use any voice from Azure’s catalog or your custom neural voices
  4. Implement intelligent barge-in: Monitor microphone input and stop speech when user starts talking
  5. Seamless audio management: Handle audio playback and interruption smoothly

This approach gives you the best of both worlds: GPT-4o’s intelligent conversation handling with Azure Speech Services’ superior voice options and control.

Deep dive: Implementation walkthrough

Let me walk you through the complete implementation, explaining each component and how they work together.

Project structure and dependencies

First, let’s look at the project structure. The solution consists of several key components:

1
2
3
4
5
6
7
8
RealtimeChat/
├── Program.cs # Main application logic
├── AppSettings.cs # Configuration classes
├── Constants.cs # Application constants
└── Helpers/
├── AudioInputHelper.cs # Microphone input and barge-in detection
├── AudioOutputHelper.cs # Audio playback management
└── ConsoleHelper.cs # Console UI utilities

The key NuGet packages you’ll need:

  • Azure.AI.OpenAI - For GPT-4o Realtime API
  • Microsoft.CognitiveServices.Speech - For Azure Speech Services
  • NAudio - For audio input/output handling
  • Microsoft.Extensions.Configuration.Json - For configuration management

Configuration setup

The configuration is designed to be flexible and environment-specific. Here’s the complete AppSettings.cs structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public class AppSettings
{
public AzureOpenAISettings AzureOpenAI { get; set; } = new();
public AzureSpeechSettings AzureSpeech { get; set; } = new();
public ConversationSettings Conversation { get; set; } = new();
public double BargeInThreshold { get; set; }
}

public class AzureOpenAISettings
{
public string Endpoint { get; set; } = "";
public string ApiKey { get; set; } = "";
public string ChatModelName { get; set; } = "";
public string RealtimeModelName { get; set; } = "";
}

public class AzureSpeechSettings
{
public string SubscriptionKey { get; set; } = "";
public string Region { get; set; } = "";
public string VoiceName { get; set; } = "";
}

public class ConversationSettings
{
public string OpenAIBuiltInVoice { get; set; } = "sage";
public float ServerDetectionThreshold { get; set; } = 0.1f;
public int ServerSilenceMs { get; set; } = 150;
}

And your appsettings.json:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
"AzureOpenAI": {
"Endpoint": "https://your-openai-resource.openai.azure.com/",
"ApiKey": "your-openai-api-key",
"ChatModelName": "gpt-4o",
"RealtimeModelName": "gpt-4o-realtime-preview"
},
"AzureSpeech": {
"SubscriptionKey": "your-speech-service-key",
"Region": "australiaeast",
"VoiceName": "en-US-AnaNeural"
},
"Conversation": {
"OpenAIBuiltInVoice": "sage",
"ServerDetectionThreshold": 0.1,
"ServerSilenceMs": 150
},
"BargeInThreshold": 0.02
}

The heart of the solution: Program.cs

The main program orchestrates all the components. Let’s break down the key sections:

Service initialization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
static (SpeechConfig, AzureOpenAIClient) InitializeServices(AppSettings appSettings)
{
// Configure Azure Speech Services
SpeechConfig speechConfig = SpeechConfig.FromSubscription(
appSettings.AzureSpeech.SubscriptionKey,
appSettings.AzureSpeech.Region
);
speechConfig.SpeechSynthesisVoiceName = appSettings.AzureSpeech.VoiceName;

// Configure Azure OpenAI client
var aoaiClient = new AzureOpenAIClient(
new Uri(appSettings.AzureOpenAI.Endpoint),
new ApiKeyCredential(appSettings.AzureOpenAI.ApiKey)
);

return (speechConfig, aoaiClient);
}

Critical: Text-only configuration

This is the key breakthrough—configuring GPT-4o Realtime to output only text, not audio:

1
2
3
4
5
6
7
8
9
10
11
12
await session.ConfigureSessionAsync(new ConversationSessionOptions()
{
Voice = new ConversationVoice(appSettings.Conversation.OpenAIBuiltInVoice),
ContentModalities = ConversationContentModalities.Text, // 🔥 This is crucial!
Instructions = Constants.MainPrompt,
InputTranscriptionOptions = new() { Model = "whisper-1" },
TurnDetectionOptions = ConversationTurnDetectionOptions
.CreateServerVoiceActivityTurnDetectionOptions(
detectionThreshold: appSettings.Conversation.ServerDetectionThreshold,
silenceDuration: TimeSpan.FromMilliseconds(appSettings.Conversation.ServerSilenceMs)
),
});

By setting ContentModalities = ConversationContentModalities.Text, we tell GPT-4o to only send us text responses, not audio bytes. This is what allows us to route the text through Azure Speech Services instead.

Advanced barge-in implementation

The barge-in feature is implemented in AudioInputHelper.cs and is one of the most sophisticated parts of the solution. Here’s how it works:

Real-time amplitude monitoring

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
private bool IsSpeechAboveThreshold(byte[] buffer, int length, double threshold)
{
double sum = 0.0;
int sampleCount = length / 2; // 16-bit samples

for (int i = 0; i < length; i += 2)
{
short sample = BitConverter.ToInt16(buffer, i);
sum += sample * (double)sample;
}

// Calculate RMS (Root Mean Square) of the audio
double rms = Math.Sqrt(sum / sampleCount);

// Normalize to [0..1] range
double normalized = rms / 32768.0;

// Compare to threshold
return normalized > threshold;
}

Smart barge-in event handling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
_waveInEvent.DataAvailable += (_, e) =>
{
// 1. Always copy to ring buffer for GPT-4o input
lock (_bufferLock)
{
// ... buffer management code ...
}

// 2. Check for user speech (barge-in detection)
if (IsSpeechAboveThreshold(e.Buffer, e.BytesRecorded, _bargeInThreshold))
{
var now = DateTime.UtcNow;
// Prevent event spam with cooldown period
if ((now - _lastSpeechDetected).TotalMilliseconds > 500)
{
_lastSpeechDetected = now;
UserSpeechDetected?.Invoke(); // Trigger barge-in!
}
}
};

Barge-in event wiring

In the main session handler, we wire up the barge-in detection:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static void HandleSessionStartedUpdate(RealtimeConversationSession session, AppSettings appSettings, SpeechSynthesizer? currentSynthesizer)
{
_ = Task.Run(async () =>
{
using AudioInputHelper audioInputHelper = AudioInputHelper.Start(appSettings.BargeInThreshold);

audioInputHelper.UserSpeechDetected += () =>
{
ConsoleHelper.DisplayMessage("<<< USER INTERRUPTION DETECTED! Stopping speech...", true);

if (currentSynthesizer != null)
{
currentSynthesizer.StopSpeakingAsync().Wait(); // Stop immediately!
}
};

await session.SendInputAudioAsync(audioInputHelper);
});
}

Streaming text processing and Azure Speech integration

The magic happens in how we handle the streaming response from GPT-4o and route it to Azure Speech Services:

Collecting streaming text

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static void HandleStreamingPartDeltaUpdate(
ConversationItemStreamingPartDeltaUpdate deltaUpdate,
Dictionary<string, StringBuilder> partialTextByItemId,
AudioOutputHelper audioOutputHelper)
{
string chunk = deltaUpdate.Text ?? deltaUpdate.AudioTranscript;

if (!string.IsNullOrWhiteSpace(chunk))
{
if (!partialTextByItemId.ContainsKey(deltaUpdate.ItemId))
{
partialTextByItemId[deltaUpdate.ItemId] = new StringBuilder();
}
partialTextByItemId[deltaUpdate.ItemId].Append(chunk);
}

// NOTE: We completely ignore deltaUpdate.AudioBytes since we're using Azure Speech
// Uncomment the next line if you want to fall back to built-in voice:
// audioOutputHelper.EnqueueForPlayback(deltaUpdate.AudioBytes);
}

Converting text to speech with Azure Speech Services

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
static async Task HandleStreamingFinishedUpdate(
ConversationItemStreamingFinishedUpdate itemFinishedUpdate,
Dictionary<string, StringBuilder> partialTextByItemId,
SpeechConfig speechConfig,
SpeechSynthesizer? currentSynthesizer)
{
if (partialTextByItemId.TryGetValue(itemFinishedUpdate.ItemId, out var sb))
{
string finalAssistantText = sb.ToString();
ConsoleHelper.DisplayMessage($"Assistant: {finalAssistantText}", true);

// Route to Azure Speech Services
await SpeakWithAzureSpeechAsync(finalAssistantText, speechConfig, currentSynthesizer);

partialTextByItemId.Remove(itemFinishedUpdate.ItemId);
}
}

static async Task<SpeechSynthesizer?> SpeakWithAzureSpeechAsync(
string text,
SpeechConfig speechConfig,
SpeechSynthesizer? synthesizer)
{
if (string.IsNullOrWhiteSpace(text)) return synthesizer;

// Stop any current speech before starting new
if (synthesizer != null)
{
await synthesizer.StopSpeakingAsync();
}

// Synthesize with Azure Speech Services
var result = await synthesizer.SpeakTextAsync(text);

if (result.Reason == ResultReason.SynthesizingAudioCompleted)
{
Console.WriteLine("✅ Speech synthesis completed successfully");
}
else if (result.Reason == ResultReason.Canceled)
{
var cancellation = SpeechSynthesisCancellationDetails.FromResult(result);
Console.WriteLine($"❌ Speech canceled: {cancellation.Reason}, {cancellation.ErrorDetails}");
}

return synthesizer;
}

Advanced scenarios and customization

Using custom neural voices

To use a custom neural voice you’ve trained in Azure Speech Studio, simply update your configuration:

1
2
3
4
5
6
7
{
"AzureSpeech": {
"VoiceName": "YourCustomVoiceName", // Your CNV endpoint name
"Region": "eastus", // Region where your CNV is deployed
"SubscriptionKey": "your-key"
}
}

SSML support for advanced voice control

You can enhance the speech synthesis with SSML for better control:

1
2
3
4
5
6
7
8
9
10
11
12
13
static async Task SpeakWithSSMLAsync(string text, SpeechConfig speechConfig, SpeechSynthesizer synthesizer)
{
string ssml = $@"
<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'>
<voice name='{speechConfig.SpeechSynthesisVoiceName}'>
<prosody rate='medium' pitch='medium'>
{System.Security.SecurityElement.Escape(text)}
</prosody>
</voice>
</speak>";

await synthesizer.SpeakSsmlAsync(ssml);
}

Fine-tuning barge-in sensitivity

The barge-in threshold is crucial for a good user experience. Too sensitive, and background noise triggers interruptions. Too high, and users can’t interrupt naturally:

1
2
3
{
"BargeInThreshold": 0.02 // Start here and adjust based on your environment
}

Values to try:

  • 0.01: Very sensitive (good for quiet environments)
  • 0.02: Balanced (recommended starting point)
  • 0.05: Less sensitive (noisy environments)

Error handling and resilience

The solution includes comprehensive error handling:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
static void HandleErrorUpdate(ConversationErrorUpdate errorUpdate)
{
ConsoleHelper.DisplayError($"❌ GPT-4o Error: {errorUpdate.Message}", true);

// Log full error details for debugging
ConsoleHelper.DisplayError($"Full error details: {errorUpdate.GetRawContent()}", false);

// Could implement retry logic here
}

static async Task HandleSpeechStartedUpdate(
ConversationInputSpeechStartedUpdate speechStartedUpdate,
SpeechSynthesizer? currentSynthesizer)
{
ConsoleHelper.DisplayMessage($"🎤 Speech detected @ {speechStartedUpdate.AudioStartTime}", true);

// Always stop current speech when user starts talking
if (currentSynthesizer != null)
{
try
{
await currentSynthesizer.StopSpeakingAsync();
}
catch (Exception ex)
{
ConsoleHelper.DisplayError($"Error stopping speech: {ex.Message}", false);
}
}
}

Performance considerations and optimization

Latency optimization

The hybrid approach adds minimal latency:

  • GPT-4o streaming: Near real-time text streaming
  • Azure Speech synthesis: 100-300ms for typical responses
  • Barge-in detection: <50ms response time

Memory management

The ring buffer implementation efficiently manages audio data:

1
2
// ~10 seconds buffer to handle network variations
private readonly byte[] _buffer = new byte[BYTES_PER_SAMPLE * SAMPLES_PER_SECOND * CHANNELS * 10];

Concurrent operations

The solution handles multiple concurrent operations smoothly:

  • Microphone input streaming to GPT-4o
  • Real-time text streaming from GPT-4o
  • Audio synthesis and playback via Azure Speech
  • Barge-in detection and response

Deployment and production considerations

Security best practices

  1. API key management: Use Azure Key Vault for production
  2. Network security: Implement proper firewall rules
  3. Authentication: Add user authentication for production apps

Scaling considerations

  1. Connection limits: Both services have concurrent connection limits
  2. Regional deployment: Deploy Speech Services in the same region as OpenAI
  3. Cost optimization: Monitor token usage and synthesis characters

Monitoring and logging

Implement comprehensive logging for production:

1
2
3
4
5
6
// Add structured logging
services.AddLogging(builder =>
{
builder.AddConsole();
builder.AddApplicationInsights(); // For production monitoring
});

Conclusion and next steps

This hybrid approach solves the key limitations of GPT-4o Realtime’s built-in voices by providing:

Unlimited voice selection: Access to 400+ Azure Speech neural voices
Custom neural voice support: Use your own trained voices
Natural barge-in capability: Users can interrupt naturally
SSML support: Advanced voice control and customization
Production-ready architecture: Robust error handling and performance

The complete sample code is available in my custom-voice-sample-code folder, which you can use as a starting point for your own applications.

What’s next?

Consider these enhancements for your implementation:

  1. Multiple voice support: Let users choose their preferred voice
  2. Emotion detection: Adjust voice characteristics based on conversation sentiment
  3. Multi-language support: Dynamically switch languages and voices
  4. Integration with Teams/Bot Framework: Extend to enterprise chat platforms

The combination of GPT-4o’s conversational intelligence with Azure Speech Services’ voice flexibility opens up entirely new possibilities for voice-enabled applications. Whether you’re building customer service bots, educational tools, or therapeutic applications, this approach gives you the control and quality you need for professional deployments.

References

Author

Ricky Gummadi

Posted on

2025-04-25

Updated on

2025-07-07

Licensed under

Comments