perl 使用逗号格式化文本 youtube 转录服务

问题描述 投票:0回答:4

如果您访问 YouTube 并点击转录按钮,对于某些视频,YouTube 有转录服务,您可以在右侧看到整个视频文本。如果您将其下载到文本文件,它就会采用这种格式。我想将时间戳与文本分开。

0:12 
well good morning everybody thank you for joining us here at the National Shrine of the 
Divine Mercy it is
0:18 
Vietnamese day and uh we're glad that you could join us uh I have a strong tie 
0:24
to the Vietnamese people my father obviously serving in Southeast Asia being in Vietnam and uh my uh Seminary
0:31 
time I went to Seminary with a lot of the Vietnamese sisters so praise be to God uh we're glad you're with us and
0:38 
today's topic really is so important and I'm coming from a aspect of a personal

我想要这样的格式,这样我就可以将其放入 Excel 工作表中,并根据需要分隔时间戳。

0:12, well good morning everybody thank you for joining us here at the National Shrine of the Divine Mercy it is
0:18, Vietnamese day and uh we're glad that you could join us uh I have a strong tie
0:24, to the Vietnamese people my father obviously serving in Southeast Asia being in Vietnam and uh my uh Seminary
0:31, time I went to Seminary with a lot of the Vietnamese sisters so praise be to God uh we're glad you're with us and
0:38, today's topic really is so important and I'm coming from a aspect of a personal

时间戳将输出到此 1:35:35

#!/usr/bin/perl
use strict;
use warnings;

if (@ARGV != 2) {
    die "Usage: $0 input_file output_file\n";
}
my ($input_file, $output_file) = @ARGV;
open(my $in, '<', $input_file) or die "Cannot open input file '$input_file': $!";
open(my $out, '>', $output_file) or die "Cannot open output file '$output_file': $!";
my $timestamp = '';
while (my $line = <$in>) {
    chomp $line;

    if ($line =~ /^[0-9:]+$/) {
        # Line is a timestamp
        $timestamp = $line;
    } elsif ($line =~ /\S/) {
        # Line is text and is not empty
        print $out "$timestamp, $line\n";
    }
}
close($in);
close($out);

print "Formatting complete. Output written to $output_file.\n";

我写了上面的脚本,但是文件是这样的。 它不应该是

, 0:12
, well good morning everybody thank you for joining us here at the National Shrine of the Divine Mercy it is
, 0:18
, Vietnamese day and uh we're glad that you could join us uh I have a strong tie
, 0:24
, to the Vietnamese people my father obviously serving in Southeast Asia being in Vietnam and uh my uh Seminary
, 0:31
, time I went to Seminary with a lot of the Vietnamese sisters so praise be to God uh we're glad you're with us and
, 0:38
, today's topic really is so important and I'm coming from a aspect of a personal

我也尝试过这个

sed 's/^\([0-9:]*\) \(.*\)$/\1\:\2/'
bash perl format
4个回答
3
投票

根据您的实际输入是什么,解决方案可能很简单:

perl -pe'chomp if /\d\s*$/' input.txt > output.txt

简单地检查一行是否以数字(和可选的空格)结尾,如果是,则删除换行符。并打印所有内容。 Perl 将读取输入文件,shell 重定向将定向输出。

现在您可能有一些尚未向我们展示的更复杂的东西。如果使用 Data::Dumper 的 useqq 选项打印输入文件,您可能会发现新的东西:

use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper <$inputfile>;

例如,您还可以通过插入空格和逗号来使输入更具可读性

perl -pe's/\s*\n/, / if /\d\s*$/'

1
投票

当该行与

\d+:\d+
后跟换行符匹配时,只需将换行符替换为逗号和空格 (
, 
)。然后打印可能修改的行。

作为一句:

perl -pe's/^\d+:\d+\K\n/, /'

作为脚本:

while ( <> ) {
   s/^\d+:\d+\K\n/, /;
   print;
}

0
投票

简短的分析是,测试

if ($line =~ /^[0-9:]+$/) {

失败了。这可能是由 Windows 终止方式引起的,或者时间戳后面可能有空格或制表符。您可以选择更简单的图案

if ($line =~ /[0-9:]+/) {

存在文本行也可能匹配的风险,或者更具体的模式:

if ($line =~ /^\s*[0-9]+:[0-9]+\s*\r?\n$/) {
  • ^
    行首
  • \s*
    任意数量的空格
  • [0-9]+
    至少一个数字
  • :
    冒号
  • [0-9]+
    至少一个数字
  • \s*
    任意数量的空格
  • \r?
    可能的窗口行尾
  • \n
    a
  • $
    完结了

-1
投票

问题似乎是时间戳末尾的空格,试试这个:

#!/usr/bin/perl
use strict;
use warnings;

if (@ARGV != 2) {
    die "Usage: $0 input_file output_file\n";
}
my ($input_file, $output_file) = @ARGV;
open(my $in, '<', $input_file) or die "Cannot open input file '$input_file': $!";
open(my $out, '>', $output_file) or die "Cannot open output file '$output_file': $!";
my $result = ''; 
while (my $line = <$in>) {
    $line =~ s/^\s+|\s+$//g;
    if ($line =~ /^[0-9:]+$/) {
        # Line is a timestamp
        print $out "$result\n" if $result;
        $result = "$line,";
    } elsif ($line !~ /^$/) {
        # Line is text and is not empty
        $result .= " $line";
    }
}
print $out "$result\n";
close($in);
close($out);

print "Formatting complete. Output written to $output_file.\n";
© www.soinside.com 2019 - 2024. All rights reserved.