如何找到SpeechSynthesizer所选语音的音频格式

Question

在C＃的文本到语音应用程序中，我使用SpeechSynthesizer类，它有一个名为SpeakProgress的事件，它为每个口语单词触发。但是对于某些声音，参数e.AudioPosition与输出音频流不同步，输出波形文件的播放速度比此位置显示的要快（参见this related question）。

无论如何，我试图找到有关比特率和与所选语音相关的其他信息的确切信息。正如我所经历的那样，如果我可以使用此信息初始化wave文件，则将解决同步问题。但是，如果我在SupportedAudioFormat找不到这样的信息，我知道找不到其他方法。例如，“Microsoft David Desktop”语音在VoiceInfo中不提供支持的格式，但它似乎支持PCM 16000 hz，16位格式。

 var formats = CurVoice.VoiceInfo.SupportedAudioFormats;

 if (formats.Count > 0)
 {
     var format = formats[0];
     reader.SetOutputToWaveFile(CurAudioFile, format);
 }
 else
 {
        var format = // How can I find it, if the audio hasn't provided it?           
        reader.SetOutputToWaveFile(CurAudioFile, format );
}

Answer 1

更新：此答案已在调查后进行了编辑。最初我从内存中建议SupportedAudioFormats可能只是来自（可能是错误配置的）注册表数据;调查显示，对于我来说，在Windows 7上，情况确实如此，并且在Windows 8上以acecdotally方式备份。

支持的音频格式的问题

System.Speech包含了令人尊敬的COM语音API（SAPI）和一些声音是32比64位，或者可能是错误配置（在64位机器的注册表，HKLM/Software/Microsoft/Speech/Voices vs HKLM/Software/Wow6432Node/Microsoft/Speech/Voices。

我已经在System.Speech及其VoiceInfo类中指出了ILSpy，我非常确信SupportedAudioFormats完全来自注册表数据，因此如果您的TTS引擎没有为您的应用程序正确注册，那么在枚举SupportedAudioFormats时可能会得到零结果。平台目标（x86，任意或64位），或者供应商根本不在注册表中提供此信息。

声音可能仍然支持不同的，额外的或更少的格式，因为这取决于语音引擎（代码）而不是注册表（数据）。所以它可以在黑暗中拍摄。在这方面，标准Windows声音通常比第三方声音更加一致，但它们仍然不一定有用地提供SupportedAudioFormats。

努力寻找这些信息

我发现它仍然可以获得当前语音的当前格式 - 但这确实依赖于反射来访问System.Speech SAPI包装器的内部。

因此，这是非常脆弱的代码！我不建议在生产中使用。

注意：以下代码确实要求您为安装调用一次Speak（）;如果没有Speak（），则需要更多调用来强制设置。但是，我可以打电话给Speak("")什么都不说，这样就行了。

执行：

[StructLayout(LayoutKind.Sequential)]
struct WAVEFORMATEX
{
    public ushort wFormatTag;
    public ushort nChannels;
    public uint nSamplesPerSec;
    public uint nAvgBytesPerSec;
    public ushort nBlockAlign;
    public ushort wBitsPerSample;
    public ushort cbSize;
}

WAVEFORMATEX GetCurrentWaveFormat(SpeechSynthesizer synthesizer)
{
    var voiceSynthesis = synthesizer.GetType()
                                    .GetProperty("VoiceSynthesizer", BindingFlags.Instance | BindingFlags.NonPublic)
                                    .GetValue(synthesizer, null);

    var ttsVoice = voiceSynthesis.GetType()
                                 .GetMethod("CurrentVoice", BindingFlags.Instance | BindingFlags.NonPublic)
                                 .Invoke(voiceSynthesis, new object[] { false });

    var waveFormat = (byte[])ttsVoice.GetType()
                                     .GetField("_waveFormat", BindingFlags.Instance | BindingFlags.NonPublic)
                                     .GetValue(ttsVoice);

    var pin = GCHandle.Alloc(waveFormat, GCHandleType.Pinned);
    var format = (WAVEFORMATEX)Marshal.PtrToStructure(pin.AddrOfPinnedObject(), typeof(WAVEFORMATEX));
    pin.Free();

    return format;
}

用法：

SpeechSynthesizer s = new SpeechSynthesizer();
s.Speak("Hello");
var format = GetCurrentWaveFormat(s);
Debug.WriteLine($"{s.Voice.SupportedAudioFormats.Count} formats are claimed as supported.");
Debug.WriteLine($"Actual format: {format.nChannels} channel {format.nSamplesPerSec} Hz {format.wBitsPerSample} audio");

为了测试它，我在AudioFormats下重命名了Microsoft Anna的HKLM/Software/Wow6432Node/Microsoft/Speech/Voices/Tokens/MS-Anna-1033-20-Dsk/Attributes注册表项，导致SpeechSynthesizer.Voice.SupportedAudioFormats在查询时没有元素。以下是这种情况下的输出：

0 formats are claimed as supported.
Actual format: 1 channel 16000 Hz 16 audio

Answer 2

您无法从代码中获取此信息。您只能收听所有格式（从8 kHz的低格式到48 kHz等高质量格式），并观察它在哪里停止变好，我认为这就是你所做的。

在内部，语音引擎仅“询问”原始音频格式的语音一次，并且我相信该值仅在语音引擎内部使用，并且语音引擎不以任何方式公开该值。

了解更多信息：

假设您是一家语音公司。您已将计算机语音录制为16 kHz，16位，单声道。

用户可以让您的声音以48 kHz，32位，立体声说话。语音引擎执行此转换。语音引擎并不关心它是否真的听起来更好，它只是进行格式转换。

假设用户想让你的声音说话。他要求将文件保存为48 kHz，16位，立体声。

SAPI / System.Speech使用以下方法调用您的声音：

STDMETHODIMP SpeechEngine::GetOutputFormat(const GUID * pTargetFormatId, const WAVEFORMATEX * pTargetWaveFormatEx,
GUID * pDesiredFormatId, WAVEFORMATEX ** ppCoMemDesiredWaveFormatEx)
{
    HRESULT hr = S_OK;

    //Here we need to return which format our audio data will be that we pass to the speech engine.
    //Our format (16 kHz, 16 bit, mono) will be converted to the format that the user requested. This will be done by the SAPI engine.

    enum SPSTREAMFORMAT sample_rate_at_which_this_voice_was_recorded = SPSF_16kHz16BitMono; //Here you tell the speech engine which format the data has that you will pass back. This way the engine knows if it should upsample you voice data or downsample to match the format that the user requested.

    hr = SpConvertStreamFormatEnum(sample_rate_at_which_this_voice_was_recorded, pDesiredFormatId, ppCoMemDesiredWaveFormatEx);

    return hr;
}

这是您必须“显示”录制的录音格式的唯一地方。

所有“可用格式”都会告诉您声卡/ Windows可以进行哪些转换。

我希望我解释得很好吗？作为语音供应商，您不支持任何格式。您只需告诉他们语音引擎您的音频数据是什么格式，以便它可以进行进一步的转换。

如何找到SpeechSynthesizer所选语音的音频格式

问题描述投票：11回答：2

2个回答

支持的音频格式的问题

努力寻找这些信息

最新问题

如何找到SpeechSynthesizer所选语音的音频格式

问题描述 投票：11回答：2

2个回答

支持的音频格式的问题

努力寻找这些信息

最新问题

问题描述投票：11回答：2