如果您访问 YouTube 并点击转录按钮,对于某些视频,YouTube 有转录服务,您可以在右侧看到整个视频文本。如果您将其下载到文本文件,它就会采用这种格式。我想将时间戳与文本分开。
0:12
well good morning everybody thank you for joining us here at the National Shrine of the
Divine Mercy it is
0:18
Vietnamese day and uh we're glad that you could join us uh I have a strong tie
0:24
to the Vietnamese people my father obviously serving in Southeast Asia being in Vietnam and uh my uh Seminary
0:31
time I went to Seminary with a lot of the Vietnamese sisters so praise be to God uh we're glad you're with us and
0:38
today's topic really is so important and I'm coming from a aspect of a personal
我想要这样的格式,这样我就可以将其放入 Excel 工作表中,并根据需要分隔时间戳。
0:12, well good morning everybody thank you for joining us here at the National Shrine of the Divine Mercy it is
0:18, Vietnamese day and uh we're glad that you could join us uh I have a strong tie
0:24, to the Vietnamese people my father obviously serving in Southeast Asia being in Vietnam and uh my uh Seminary
0:31, time I went to Seminary with a lot of the Vietnamese sisters so praise be to God uh we're glad you're with us and
0:38, today's topic really is so important and I'm coming from a aspect of a personal
时间戳将输出到此 1:35:35
#!/usr/bin/perl
use strict;
use warnings;
if (@ARGV != 2) {
die "Usage: $0 input_file output_file\n";
}
my ($input_file, $output_file) = @ARGV;
open(my $in, '<', $input_file) or die "Cannot open input file '$input_file': $!";
open(my $out, '>', $output_file) or die "Cannot open output file '$output_file': $!";
my $timestamp = '';
while (my $line = <$in>) {
chomp $line;
if ($line =~ /^[0-9:]+$/) {
# Line is a timestamp
$timestamp = $line;
} elsif ($line =~ /\S/) {
# Line is text and is not empty
print $out "$timestamp, $line\n";
}
}
close($in);
close($out);
print "Formatting complete. Output written to $output_file.\n";
我写了上面的脚本,但是文件是这样的。 它不应该是
, 0:12
, well good morning everybody thank you for joining us here at the National Shrine of the Divine Mercy it is
, 0:18
, Vietnamese day and uh we're glad that you could join us uh I have a strong tie
, 0:24
, to the Vietnamese people my father obviously serving in Southeast Asia being in Vietnam and uh my uh Seminary
, 0:31
, time I went to Seminary with a lot of the Vietnamese sisters so praise be to God uh we're glad you're with us and
, 0:38
, today's topic really is so important and I'm coming from a aspect of a personal
我也尝试过这个
sed 's/^\([0-9:]*\) \(.*\)$/\1\:\2/'
根据您的实际输入是什么,解决方案可能很简单:
perl -pe'chomp if /\d\s*$/' input.txt > output.txt
简单地检查一行是否以数字(和可选的空格)结尾,如果是,则删除换行符。并打印所有内容。 Perl 将读取输入文件,shell 重定向将定向输出。
现在您可能有一些尚未向我们展示的更复杂的东西。如果使用 Data::Dumper 的 useqq 选项打印输入文件,您可能会发现新的东西:
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper <$inputfile>;
例如,您还可以通过插入空格和逗号来使输入更具可读性
perl -pe's/\s*\n/, / if /\d\s*$/'
当该行与
\d+:\d+
后跟换行符匹配时,只需将换行符替换为逗号和空格 (,
)。然后打印可能修改的行。
作为一句:
perl -pe's/^\d+:\d+\K\n/, /'
作为脚本:
while ( <> ) {
s/^\d+:\d+\K\n/, /;
print;
}
简短的分析是,测试
if ($line =~ /^[0-9:]+$/) {
失败了。这可能是由 Windows 终止方式引起的,或者时间戳后面可能有空格或制表符。您可以选择更简单的图案
if ($line =~ /[0-9:]+/) {
存在文本行也可能匹配的风险,或者更具体的模式:
if ($line =~ /^\s*[0-9]+:[0-9]+\s*\r?\n$/) {
^
行首\s*
任意数量的空格[0-9]+
至少一个数字:
冒号[0-9]+
至少一个数字\s*
任意数量的空格\r?
可能的窗口行尾\n
a
$
完结了问题似乎是时间戳末尾的空格,试试这个:
#!/usr/bin/perl
use strict;
use warnings;
if (@ARGV != 2) {
die "Usage: $0 input_file output_file\n";
}
my ($input_file, $output_file) = @ARGV;
open(my $in, '<', $input_file) or die "Cannot open input file '$input_file': $!";
open(my $out, '>', $output_file) or die "Cannot open output file '$output_file': $!";
my $result = '';
while (my $line = <$in>) {
$line =~ s/^\s+|\s+$//g;
if ($line =~ /^[0-9:]+$/) {
# Line is a timestamp
print $out "$result\n" if $result;
$result = "$line,";
} elsif ($line !~ /^$/) {
# Line is text and is not empty
$result .= " $line";
}
}
print $out "$result\n";
close($in);
close($out);
print "Formatting complete. Output written to $output_file.\n";