如何从各种语言的 YouTube 视频中提取字幕

问题描述 投票:0回答:2

我使用下面的代码从 YouTube 视频中提取字幕,但它仅适用于英文视频。我有一些西班牙语视频,所以我想知道如何修改代码以提取西班牙语字幕?

from pytube import YouTube
from youtube_transcript_api import YouTubeTranscriptApi

# Define the video URL or ID of the YouTube video you want to extract text from
video_url = 'https://www.youtube.com/watch?v=xYgoNiSo-kY'

# Download the video using pytube
youtube = YouTube(video_url)
video = youtube.streams.get_highest_resolution()
video.download()

# Get the downloaded video file path
video_path = video.default_filename

# Get the video ID from the URL
video_id = video_url.split('v=')[-1]

# Get the transcript for the specified video ID
transcript = YouTubeTranscriptApi.get_transcript(video_id)

# Extract the text from the transcript
captions_text = ''
for segment in transcript:
    caption = segment['text']
    captions_text += caption + ' '

# Print the extracted text
print(captions_text)
python web-scraping nlp youtube video-streaming
2个回答
1
投票

使用 - list_transcripts - 获取可用语言列表:

示例:

video_id = 'xYgoNiSo-kY'
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

然后,循环

transcript_list
变量以查看获得的可用语言:

示例:

for x, tr in enumerate(transcript_list):
  print(tr.language_code)

在这种情况下,结果是:

修改代码以循环视频上可用的语言并下载生成的字幕:

示例:

# Variables for store the downloaded captions:
all_captions = []
caption = None
captions_text = ''

# Loop all languages available for this video and download the generated captions:
for x, tr in enumerate(transcript_list):
  print("Downloading captions in " + tr.language + "...")
  transcript_obtained_in_language = transcript_list.find_transcript([tr.language_code]).fetch()
  for segment in transcript_obtained_in_language:
    caption = segment['text']
    captions_text += caption + ' '
  all_captions.append({"language " : tr.language_code + " - " + tr.language, "captions" : captions_text})
  caption = None
  captions_text = ''
  print("="*20)
print("Done")

all_captions
变量中,将存储从给定
VIDEO_ID
获得的字幕和语言。


0
投票

您可以在通话中添加语言参数。试试这个:

更改自:

transcript = YouTubeTranscriptApi.get_transcript(video_id)

至:

transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['es'])
© www.soinside.com 2019 - 2024. All rights reserved.