在蟒蛇的话斯普利特语音音频文件

问题描述 投票:15回答:4

我觉得这是一个相当普遍的问题,但我还没有找到一个合适的答案。我有人类语音的多种音频文件,我想上的话,它可以通过启发式寻找在波形的暂停来完成突破,但任何人都可以点我的函数/库在Python自动执行此操作?

python audio speech-recognition speech heuristics
4个回答
21
投票

一个简单的方法来做到这一点是使用pydub模块。最近除了silent utilities的完成所有的繁重如setting up silence threaholdsetting up silence length。等,并显著简化代码,而不是提及的其它方法。

这里是一个演示实现,灵感来自于here

设定:

我曾与从A口语英文字母一个音频文件的文件“A-z.wav”中Z。子目录splitAudio在当前工作目录中创建。在执行演示代码,文件被分离到26个单独的文件与每个音频文件存储各音节。

观察:某些音节被切断,可能需要以下参数的修改, min_silence_len=500 silence_thresh=-16

一个可能要调整这些对自己的要求。

演示代码:

from pydub import AudioSegment
from pydub.silence import split_on_silence

sound_file = AudioSegment.from_wav("a-z.wav")
audio_chunks = split_on_silence(sound_file, 
    # must be silent for at least half a second
    min_silence_len=500,

    # consider it silent if quieter than -16 dBFS
    silence_thresh=-16
)

for i, chunk in enumerate(audio_chunks):

    out_file = ".//splitAudio//chunk{0}.wav".format(i)
    print "exporting", out_file
    chunk.export(out_file, format="wav")

输出:

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>> 
exporting .//splitAudio//chunk0.wav
exporting .//splitAudio//chunk1.wav
exporting .//splitAudio//chunk2.wav
exporting .//splitAudio//chunk3.wav
exporting .//splitAudio//chunk4.wav
exporting .//splitAudio//chunk5.wav
exporting .//splitAudio//chunk6.wav
exporting .//splitAudio//chunk7.wav
exporting .//splitAudio//chunk8.wav
exporting .//splitAudio//chunk9.wav
exporting .//splitAudio//chunk10.wav
exporting .//splitAudio//chunk11.wav
exporting .//splitAudio//chunk12.wav
exporting .//splitAudio//chunk13.wav
exporting .//splitAudio//chunk14.wav
exporting .//splitAudio//chunk15.wav
exporting .//splitAudio//chunk16.wav
exporting .//splitAudio//chunk17.wav
exporting .//splitAudio//chunk18.wav
exporting .//splitAudio//chunk19.wav
exporting .//splitAudio//chunk20.wav
exporting .//splitAudio//chunk21.wav
exporting .//splitAudio//chunk22.wav
exporting .//splitAudio//chunk23.wav
exporting .//splitAudio//chunk24.wav
exporting .//splitAudio//chunk25.wav
exporting .//splitAudio//chunk26.wav
>>> 

3
投票

使用IBM STT。使用timestamps=true你会得到的话,当系统检测到它们已经说出沿分手。

还有很多其他很酷的功能,如word_alternatives_threshold将文字和word_confidence的其他可能性来获得与该系统预测字的信心。设置word_alternatives_threshold到(0.1和0.01)之间得到一个真正的想法。

这需要签署,之后您可以使用所产生的用户名和密码。

IBM的STT已经提到的语音识别模块的一部分,但获得了这个词的时间戳,您将需要修改的功能。

提取的和修饰的形式如下:

def extracted_from_sr_recognize_ibm(audio_data, username=IBM_USERNAME, password=IBM_PASSWORD, language="en-US", show_all=False, timestamps=False,
                                word_confidence=False, word_alternatives_threshold=0.1):
    assert isinstance(username, str), "``username`` must be a string"
    assert isinstance(password, str), "``password`` must be a string"

    flac_data = audio_data.get_flac_data(
        convert_rate=None if audio_data.sample_rate >= 16000 else 16000,  # audio samples should be at least 16 kHz
        convert_width=None if audio_data.sample_width >= 2 else 2  # audio samples should be at least 16-bit
    )
    url = "https://stream-fra.watsonplatform.net/speech-to-text/api/v1/recognize?{}".format(urlencode({
        "profanity_filter": "false",
        "continuous": "true",
        "model": "{}_BroadbandModel".format(language),
        "timestamps": "{}".format(str(timestamps).lower()),
        "word_confidence": "{}".format(str(word_confidence).lower()),
        "word_alternatives_threshold": "{}".format(word_alternatives_threshold)
    }))
    request = Request(url, data=flac_data, headers={
        "Content-Type": "audio/x-flac",
        "X-Watson-Learning-Opt-Out": "true",  # prevent requests from being logged, for improved privacy
    })
    authorization_value = base64.standard_b64encode("{}:{}".format(username, password).encode("utf-8")).decode("utf-8")
    request.add_header("Authorization", "Basic {}".format(authorization_value))

    try:
        response = urlopen(request, timeout=None)
    except HTTPError as e:
        raise sr.RequestError("recognition request failed: {}".format(e.reason))
    except URLError as e:
        raise sr.RequestError("recognition connection failed: {}".format(e.reason))
    response_text = response.read().decode("utf-8")
    result = json.loads(response_text)

    # return results
    if show_all: return result
    if "results" not in result or len(result["results"]) < 1 or "alternatives" not in result["results"][0]:
        raise Exception("Unknown Value Exception")

    transcription = []
    for utterance in result["results"]:
        if "alternatives" not in utterance:
            raise Exception("Unknown Value Exception. No Alternatives returned")
        for hypothesis in utterance["alternatives"]:
            if "transcript" in hypothesis:
                transcription.append(hypothesis["transcript"])
    return "\n".join(transcription)

2
投票

你可以看一下Audiolab它提供了一个体面的API,以话音样本转换成numpy阵列。该AUDIOLAB的模块使用libsndfile C ++库做繁重。

然后,您可以解析阵列寻找较低的数值找到暂停。


1
投票

pyAudioAnalysis可以细分如果字被清楚地分开的音频文件(这是很少在自然语音的情况下)。该软件包是比较好用的:

python pyAudioAnalysis/pyAudioAnalysis/audioAnalysis.py silenceRemoval -i SPEECH_AUDIO_FILE_TO_SPLIT.mp3 --smoothing 1.0 --weight 0.3

blog更多细节。

© www.soinside.com 2019 - 2024. All rights reserved.