假设我有一个文本句子:
$body = 'the quick brown fox jumps over the lazy dog';
我想将该句子放入“关键字”的哈希值中,但我想允许使用多个单词的关键字;我有以下方法来获取单个单词关键字:
$words{$_}++ for $body =~ m/(\w+)/g;
完成后,我有一个如下所示的哈希值:
'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1
下一步,以便我可以获得 2 个单词的关键字,如下所示:
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
但这只能得到所有“其他”对;看起来像这样:
'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1
我还需要一个字的偏移量:
'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1
还有比以下更简单的方法吗?
my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
虽然所描述的任务手动编码可能很有趣, 使用处理 n-gram 的现有 CPAN 模块不是更好吗?看起来
Text::Ngrams
(而不是 Text::Ngram
)可以处理基于单词的 n-gram 分析。
你可以用 lookaheads 做一些有点时髦的事情:
如果我这样做:
$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;
该表达式表示向前寻找两个单词(并捕获它们),但消耗 1。
我得到:
%words: {
'brown fox' => 1,
'fox jumps' => 1,
'jumps over' => 1,
'lazy dog' => 1,
'over the' => 1,
'quick brown' => 1,
'the lazy' => 1,
'the quick' => 1
}
看来我可以通过放入计数变量来概括这一点:
my $n = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;
我会使用look-ahead来收集除第一个单词之外的所有内容。这样,位置就会自动正确前进:
my $body = 'the quick brown fox jumps over the lazy dog';
my %words;
++$words{$1} while $body =~ m/(\w+)/g;
++$words{"$1 $2"} while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;
如果您想坚持使用单个空格而不是
\s+
,则可以稍微简化一下(如果这样做,请不要忘记删除 /x
修饰符),因为您可以在 $2
中收集任意数量的单词
,而不是每个单词使用一组。
单独使用正则表达式执行此操作有什么特殊原因吗?对我来说,显而易见的方法是将文本
split
放入数组中,然后使用一对嵌套循环从中提取计数。大致如下:
#!/usr/bin/env perl
use strict;
use warnings;
my $text = 'the quick brown fox jumps over the lazy dog';
my $max_words = 3;
my @words = split / /, $text;
my %counts;
for my $pos (0 .. $#words) {
for my $phrase_len (0 .. ($pos >= $max_words ? $max_words - 1 : $pos)) {
my $phrase = join ' ', @words[($pos - $phrase_len) .. $pos];
$counts{$phrase}++;
}
}
use Data::Dumper;
print Dumper(\%counts);
输出:
$VAR1 = {
'over the lazy' => 1,
'the' => 2,
'over' => 1,
'brown fox jumps' => 1,
'brown fox' => 1,
'the lazy dog' => 1,
'jumps over' => 1,
'the lazy' => 1,
'the quick brown' => 1,
'fox jumps' => 1,
'over the' => 1,
'brown' => 1,
'fox jumps over' => 1,
'quick brown' => 1,
'jumps' => 1,
'lazy' => 1,
'jumps over the' => 1,
'lazy dog' => 1,
'dog' => 1,
'quick brown fox' => 1,
'fox' => 1,
'the quick' => 1,
'quick' => 1
};
编辑:修复了
$phrase_len
循环以防止使用负索引,根据 cjm 的评论,负索引会导致错误的结果。
pos
运算符
位置标量
返回最后一次
搜索所涉及变量的偏移量(未指定变量时使用m//g
)。$_
@-
特殊数组
@LAST_MATCH_START
@-
是最后一次成功匹配开始的偏移量。$-[0]
是第 n 个子模式匹配的子字符串开头的偏移量,如果子模式不匹配,则为$-[n]
。undef
例如,下面的程序在其自己的捕获中抓取每对的第二个单词,并倒回匹配的位置,因此第二个单词将是下一对的第一个单词:
#! /usr/bin/perl
use warnings;
use strict;
my $body = 'the quick brown fox jumps over the lazy dog';
my %words;
while ($body =~ /(\w+ (\w+))/g) {
++$words{$1};
pos($body) = $-[2];
}
for (sort { index($body,$a) <=> index($body,$b) } keys %words) {
print "'$_' => $words{$_}\n";
}
输出:
“快”=> 1 '快速棕色' => 1 '棕色狐狸' => 1 '狐狸跳跃' => 1 '跳过' => 1 '超过' => 1 “懒惰者”=> 1 “懒狗”=> 1