基于句子的preg_split

问题描述 投票:0回答:2

我有以下脚本来分割句子。 除了标点符号之外,我还想将一些短语视为句子的结尾。 如果它是单个字符,则效果很好,但当它有空格时则不行。

这是我有效的代码:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?:\#*]             # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n]              # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear  
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",    
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| U\.S\.A\.
| U\.S\.
| Sr\.              # or "Sr.",
| T\.V\.A\.         # or "T.V.A.",
| a\.m\.            # or "a.m.",
| p\.m\.            # or "p.m.",
| a€¢\.
| :\.

                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.

/ix';

这是我尝试添加的示例短语: 「总收入」

我尝试过用这些方式格式化它,但都不起作用:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?:\#*]             # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n]              # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear  
| "Total Gross Income"
| Total[ X]Gross[ X]Income
| Total" "Gross" "Income
)  

例如,如果我有以下代码:

$block_o_text = "You could receive the wrong amount. If you receive more benefits than you    should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross Income Total ResourcesMedical ProgramsHousehold.";

$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);

for ($i = 0; $i < count($sentences); ++$i) {
    echo $i . " - " . $sentance . "<BR>";
}

我得到的结果是:

77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income Total ResourcesMedical ProgramsHousehold 

我想要得到的是:

77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income
83 - Total ResourcesMedical ProgramsHousehold 

我做错了什么?

php regex
2个回答
1
投票

你的问题在于你的lookbehind后面的空白声明 - 它至少需要一个空白才能分割,但如果你删除它,那么你最终会捕获前面的字母并破坏整个事情。

因此,据我所知,你不能完全通过环视来做到这一点。您仍然需要让一些表达式与环视(标点符号前面的空格等)一起使用,但对于特定的短语,您不能。

您还可以使用

PREG_SPLIT_DELIM_CAPTURE
标志来捕获您正在拆分的内容。像这样的事情应该让你开始:

$re = '/((?<=[\.\?\!])\s+|Total\sGross\sIncome)/ix';

$block_o_text = "You could receive the wrong amount. If you receive more benefits than you    should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross IncomeTotal ResourcesMedical ProgramsHousehold.";

$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

for ($i = 0; $i < count($sentences); ++$i) {
    if (!ctype_space($sentences[$i])) {
        echo $i . " - " . $sentences[$i] . "<br>";
    }
}

输出:

0 - You could receive the wrong amount.
2 - If you receive more benefits than you should, you must pay them back.
4 - When will we review your case?
6 - An eligibility review form will be sent before your benefits stop.
8 - Total Gross Income
9 - Total ResourcesMedical ProgramsHousehold.

0
投票

让我们将视角从最初的拆分然后循环的过程转变为在适当的情况下更直接地注入字符串。 所有的处理都可以在

preg_replace_callback()
内完成。

  • (*SKIP)(*FAIL)
    匹配和取消非句子结尾。
  • 匹配字符串的开头。
  • 匹配任何垂直空格或制表符。
  • 匹配白名单单词 Total 之前的空格。
  • 匹配列入白名单的行结尾,然后用
    \K
    忘记它们,然后匹配空格。
$regex = <<<REGEX
/                     # Split sentences on whitespace between them.
(?:
  (?:                 # skip-fail words
    Mr\.              # "Mr."
    | Mr?s\.          # or "Mrs." or "Ms.",
    | [DJS]r\.        # or "Jr." or "Dr." or "Sr.",
    | Prof\.          # or "Prof.",
    | U\.S\.(?:A\.)?
    | T\.V\.A\.       # or "T.V.A.",
    | [ap]\.m\.       # or "a.m." or "p.m.",
    | a€¢\.
    | :\.
  )
  (*SKIP)(*FAIL)
)
|
^
|
[\t\v]+
|
\s+(?=Total)
|
(?:
  (?:
    [.!?:#*]
    | [.!?:]['"]
    | HYPERLINK
    | \.org
    | \.gov
    | \.aspx
    | \.com
    | Date
    | Dear  
  )
  \K\s+    # Split on whitespace between sentences.
)
/ix
REGEX;

在替换回调中,初始化静态计数器,然后有条件地添加所需的格式和每个分隔点。

$block_o_text = "You could receive the wrong amount. If you receive more benefits than you    should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross Income Total ResourcesMedical ProgramsHousehold.";

echo preg_replace_callback(
         $regex,
         function ($m) {
             static $i = 77;
             return ($i === 77 ? '' : '<br>' . PHP_EOL) . $i++ . ' - ';
         },
         $block_o_text
     );

输出:

77 - You could receive the wrong amount.<br>
78 - If you receive more benefits than you    should, you must pay them back.<br>
79 - When will we review your case?<br>
80 - An eligibility review form will be sent before your benefits stop.<br>
81 - Total Gross Income<br>
82 - Total ResourcesMedical ProgramsHousehold.
© www.soinside.com 2019 - 2024. All rights reserved.