我使用下面的代码从 YouTube 视频中提取字幕,但它仅适用于英文视频。我有一些西班牙语视频,所以我想知道如何修改代码以提取西班牙语字幕?
from pytube import YouTube
from youtube_transcript_api import YouTubeTranscriptApi
# Define the video URL or ID of the YouTube video you want to extract text from
video_url = 'https://www.youtube.com/watch?v=xYgoNiSo-kY'
# Download the video using pytube
youtube = YouTube(video_url)
video = youtube.streams.get_highest_resolution()
video.download()
# Get the downloaded video file path
video_path = video.default_filename
# Get the video ID from the URL
video_id = video_url.split('v=')[-1]
# Get the transcript for the specified video ID
transcript = YouTubeTranscriptApi.get_transcript(video_id)
# Extract the text from the transcript
captions_text = ''
for segment in transcript:
caption = segment['text']
captions_text += caption + ' '
# Print the extracted text
print(captions_text)
使用 - list_transcripts - 获取可用语言列表:
示例:
video_id = 'xYgoNiSo-kY'
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
然后,循环
transcript_list
变量以查看获得的可用语言:
示例:
for x, tr in enumerate(transcript_list):
print(tr.language_code)
在这种情况下,结果是:
是
修改代码以循环视频上可用的语言并下载生成的字幕:
示例:
# Variables for store the downloaded captions:
all_captions = []
caption = None
captions_text = ''
# Loop all languages available for this video and download the generated captions:
for x, tr in enumerate(transcript_list):
print("Downloading captions in " + tr.language + "...")
transcript_obtained_in_language = transcript_list.find_transcript([tr.language_code]).fetch()
for segment in transcript_obtained_in_language:
caption = segment['text']
captions_text += caption + ' '
all_captions.append({"language " : tr.language_code + " - " + tr.language, "captions" : captions_text})
caption = None
captions_text = ''
print("="*20)
print("Done")
在
all_captions
变量中,将存储从给定 VIDEO_ID
获得的字幕和语言。
您可以在通话中添加语言参数。试试这个:
更改自:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
至:
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['es'])