perl 使用逗号格式化文本 youtube 转录服务

Question

如果您访问 YouTube 并点击转录按钮，对于某些视频，YouTube 有转录服务，您可以在右侧看到整个视频文本。如果您将其下载到文本文件，它就会采用这种格式。我想将时间戳与文本分开。

0:12 
well good morning everybody thank you for joining us here at the National Shrine of the 
Divine Mercy it is
0:18 
Vietnamese day and uh we're glad that you could join us uh I have a strong tie 
0:24
to the Vietnamese people my father obviously serving in Southeast Asia being in Vietnam and uh my uh Seminary
0:31 
time I went to Seminary with a lot of the Vietnamese sisters so praise be to God uh we're glad you're with us and
0:38 
today's topic really is so important and I'm coming from a aspect of a personal

我想要这样的格式，这样我就可以将其放入 Excel 工作表中，并根据需要分隔时间戳。

0:12, well good morning everybody thank you for joining us here at the National Shrine of the Divine Mercy it is
0:18, Vietnamese day and uh we're glad that you could join us uh I have a strong tie
0:24, to the Vietnamese people my father obviously serving in Southeast Asia being in Vietnam and uh my uh Seminary
0:31, time I went to Seminary with a lot of the Vietnamese sisters so praise be to God uh we're glad you're with us and
0:38, today's topic really is so important and I'm coming from a aspect of a personal

时间戳将输出到此 1:35:35

#!/usr/bin/perl
use strict;
use warnings;

if (@ARGV != 2) {
    die "Usage: $0 input_file output_file\n";
}
my ($input_file, $output_file) = @ARGV;
open(my $in, '<', $input_file) or die "Cannot open input file '$input_file': $!";
open(my $out, '>', $output_file) or die "Cannot open output file '$output_file': $!";
my $timestamp = '';
while (my $line = <$in>) {
    chomp $line;

    if ($line =~ /^[0-9:]+$/) {
        # Line is a timestamp
        $timestamp = $line;
    } elsif ($line =~ /\S/) {
        # Line is text and is not empty
        print $out "$timestamp, $line\n";
    }
}
close($in);
close($out);

print "Formatting complete. Output written to $output_file.\n";

我写了上面的脚本，但是文件是这样的。它不应该是

, 0:12
, well good morning everybody thank you for joining us here at the National Shrine of the Divine Mercy it is
, 0:18
, Vietnamese day and uh we're glad that you could join us uh I have a strong tie
, 0:24
, to the Vietnamese people my father obviously serving in Southeast Asia being in Vietnam and uh my uh Seminary
, 0:31
, time I went to Seminary with a lot of the Vietnamese sisters so praise be to God uh we're glad you're with us and
, 0:38
, today's topic really is so important and I'm coming from a aspect of a personal

我也尝试过这个

sed 's/^\([0-9:]*\) \(.*\)$/\1\:\2/'

Answer 1

根据您的实际输入是什么，解决方案可能很简单：

perl -pe'chomp if /\d\s*$/' input.txt > output.txt

简单地检查一行是否以数字（和可选的空格）结尾，如果是，则删除换行符。并打印所有内容。 Perl 将读取输入文件，shell 重定向将定向输出。

现在您可能有一些尚未向我们展示的更复杂的东西。如果使用 Data::Dumper 的 useqq 选项打印输入文件，您可能会发现新的东西：

use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper <$inputfile>;

例如，您还可以通过插入空格和逗号来使输入更具可读性

perl -pe's/\s*\n/, / if /\d\s*$/'

Answer 2

当该行与

\d+:\d+

后跟换行符匹配时，只需将换行符替换为逗号和空格 (

)。然后打印可能修改的行。

作为一句：

perl -pe's/^\d+:\d+\K\n/, /'

作为脚本：

while ( <> ) {
   s/^\d+:\d+\K\n/, /;
   print;
}

Answer 3

简短的分析是，测试

if ($line =~ /^[0-9:]+$/) {

失败了。这可能是由 Windows 终止方式引起的，或者时间戳后面可能有空格或制表符。您可以选择更简单的图案

if ($line =~ /[0-9:]+/) {

存在文本行也可能匹配的风险，或者更具体的模式：

if ($line =~ /^\s*[0-9]+:[0-9]+\s*\r?\n$/) {

```
^
```
行首
```
\s*
```
任意数量的空格
```
[0-9]+
```
至少一个数字
```
:
```
冒号
```
[0-9]+
```
至少一个数字
```
\s*
```
任意数量的空格
```
\r?
```
可能的窗口行尾
```
\n
```
a
```
$
```
完结了

Answer 4

问题似乎是时间戳末尾的空格，试试这个：

#!/usr/bin/perl
use strict;
use warnings;

if (@ARGV != 2) {
    die "Usage: $0 input_file output_file\n";
}
my ($input_file, $output_file) = @ARGV;
open(my $in, '<', $input_file) or die "Cannot open input file '$input_file': $!";
open(my $out, '>', $output_file) or die "Cannot open output file '$output_file': $!";
my $result = ''; 
while (my $line = <$in>) {
    $line =~ s/^\s+|\s+$//g;
    if ($line =~ /^[0-9:]+$/) {
        # Line is a timestamp
        print $out "$result\n" if $result;
        $result = "$line,";
    } elsif ($line !~ /^$/) {
        # Line is text and is not empty
        $result .= " $line";
    }
}
print $out "$result\n";
close($in);
close($out);

print "Formatting complete. Output written to $output_file.\n";

perl 使用逗号格式化文本 youtube 转录服务

问题描述投票：0回答：4

4个回答

最新问题

perl 使用逗号格式化文本 youtube 转录服务

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4