我需要实现可以使用 WebRTC 作为音频源的连续实时语音到文本。我很想使用
speech_recognition
库(here),因为它有这个美妙的 .listen()
方法,可以完美地用作 VAD,并且还可以轻松创建一个 wav 文件,以便稍后将其提供给我选择的 STT 模型。简而言之,我想像这样构建它(归功于Ashutosh Dongare):
audio_input_file = "micAudioInput.wav"
audio_output_file = "ttsAudioOutput.wav"
while True:
try:
with sr.Microphone() as SR_AudioSource:
print("Say something...")
mic_audio = r_audio.listen(SR_AudioSource,1,4) # timeout=1, phrase_time_limit=4
except: #This is required if there is no Audio input or some error
print("Found not capture mic input or no audio...")
continue #continue next mic input loop if there was any error
with open(audio_input_file, "wb") as file:
file.write(mic_audio.get_wav_data())
file.flush()
file.close()
# Check if there is input audio file saved otherwise continue listening
if not os.path.exists(audio_input_file):
print("no input file exists")
continue
# Read audio file into batches and create input pipelie for STT
batches = split_into_batches(glob(audio_input_file), batch_size=10)
readbatchout = read_batch(batches[0])
input = prepare_model_input(read_batch(batches[0]), device=device)
#feed to STT model and get the text output
output = stt_model(input)
you_said = decoder(output[0].cpu())
print(you_said)
if(you_said == ""):
print("No speech recognized...")
continue
#check if user wants to stop
if(re.search("exit",you_said) or re.search("stop",you_said) or re.search("quit",you_said)):
break
它可以在本地使用物理麦克风完美运行。唯一的问题是,我非常确定
speech_recognition
只接受作为 AudioSource
运行的设备的麦克风(并且我知道 PyAudio 只能以这种方式工作),因为 文档指出 Microphone
班级
[c]创建一个新的麦克风实例,它代表一个物理麦克风 电脑上的麦克风。
这是否意味着无法将其与 WebRTC 作为源一起使用?如果没有,什么是它的
.listen()
的良好替代品?另外:无限循环的概念在浏览器中通常是可行的还是一个坏主意?我计划在 Django 中实现它并处理 WebSocket
带有 Django Channels 的服务器。
如果有人偶然发现这个答案,答案有点晚了。是的,SpeechRecognition 可以与 WebRTC 作为音频源一起使用。
我是这样做的:
aiortc实现自定义
MediaStreamTrack
MediaStreamTrack
有一个来自 SpeechRecognition 的自定义 AudioSource
成员AudioSource
中存在的AudioFifo
AudioSource
只是Fifo的一个直通,添加了一点逻辑首先,这是自定义
MediaStreamTrack
的代码,灵感来自 aiortc 的 VideoTransformTrack 示例:
class AudioTransformTrack(MediaStreamTrack):
kind = "audio"
def __init__(self, track):
super().__init__()
self.track = track
rate = 16_000 # Whisper has a sample rate of 16000
audio_format = 's16p'
sample_width = av.AudioFormat(audio_format).bytes
self.resampler = av.AudioResampler(format=audio_format, layout='mono', rate=rate)
self.source = WebRTCSource(sample_rate=rate, sample_width=sample_width)
async def recv(self):
out_frame: av.AudioFrame = await self.track.recv()
out_frames = self.resampler.resample(out_frame)
for frame in out_frames:
self.source.stream.write(frame)
return out_frame
接下来是受 Microphone 类启发的自定义 AudioSource 的实现:
class WebRTCSource(AudioSource):
def __init__(self, sample_rate=None, chunk_size=1024, sample_width=4):
# Those are the only 4 properties required by the recognizer.listen method
self.stream = WebRTCSource.MicrophoneStream()
self.SAMPLE_RATE = sample_rate # sampling rate in Hertz
self.CHUNK = chunk_size # number of frames stored in each buffer
self.SAMPLE_WIDTH = sample_width # size of each sample
class MicrophoneStream(object):
def __init__(self):
self.stream = av.AudioFifo()
self.event = threading.Event()
def write(self, frame: av.AudioFrame):
assert type(frame) is av.AudioFrame, "Tried to write something that is not AudioFrame"
self.stream.write(frame=frame)
self.event.set()
def read(self, size) -> bytes:
frames: av.AudioFrame = self.stream.read(size)
# while no frame, wait until some is written using an event
while frames is None:
self.event.wait()
self.event.clear()
frames = self.stream.read(size)
# convert the frame to bytes
data: np.ndarray = frames.to_ndarray()
return data.tobytes()
收听源时,请确保异步执行此操作。就我而言,我启动了一个新线程来进行监听。
def listen(source: AudioSource):
recognizer = sr.Recognizer()
while True:
audio = recognizer.listen(source)
# do something with the audio...
# on webrtc receive track:
t = AudioTransformTrack(relay.subscribe(track))
thread = threading.Thread(target=listen, args=(t.source,))
thread.start()