正则表达式来匹配具有特定属性的 html 标签

问题描述 投票:0回答:5

我正在尝试匹配所有没有“term”或“range”属性的 HTML 标签

这是示例 HTML 格式

<span class="inline prewrap strong">DATE:</span>    12/01/10
<span class="inline prewrap strong">MR:</span>  1234567
<span class="inline prewrap strong">DOB:</span> 12/01/65
<span class="inline prewrap strong">HISTORY OF PRESENT ILLNESS:</span>  Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

<span class="inline prewrap strong">MEDICATIONS:</span>  <span term="Advil" range="true">Advil </span>and Ibuprofen.

我的正则表达式是:

<(.*?)((?!\bterm\b).)>

不幸的是,这匹配所有标签...如果内部文本不匹配,那就太好了,因为我需要过滤掉除具有该特定属性的标签之外的所有标签。

regex pattern-matching string-matching
5个回答
16
投票

如果您喜欢正则表达式,那么这对我有用。 (注意 - 不包括过滤掉评论、文档类型和其他实体。
其他警告;标签可以嵌入脚本、评论和其他内容中。)

span标签(w/ attr)没有术语|范围属性

'<span
  (?=\s)
  (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
>'

任何标签(w/ attr)无术语|范围属性

'<[A-Za-z_:][\w:.-]*
  (?=\s)
  (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
>'

任何标签(w/o attr)无术语|范围属性

'<
  (?:
    [A-Za-z_:][\w:.-]*
    (?=\s)
    (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
    \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
  |
    /?[A-Za-z_:][\w:.-]*\s*/?
  )
>'

更新

使用 (?>) 结构的替代方案
以下正则表达式适用于无“术语|范围”属性
标志 = (g)global 和 (s)dotall

带属性的跨度标签
链接:http://regexr.com?2vrjr
正则表达式:

<span(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>

任何带有属性的标签
链接:http://regexr.com?2vrju
正则表达式:

<[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>

任何带有attr或wo/attr的标签
链接:http://regexr.com?2vrk1
正则表达式:

<(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>

'匹配除 term="occasionally" 之外的所有标签'

链接:http://regexr.com?2vrka

<(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)term\s*=\s*(["'])\s*occasionally\s*\1)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>


2
投票

我认为你应该使用 HTML 解析器来解决这个问题。创建自己的正则表达式是可能的,但肯定是错误的。想象一下你的代码包含这样的表达式

< span      class = "a"              >b< / span         >

它也是有效的,但是考虑正则表达式中所有可能的空格和制表符并不容易,并且需要进行测试才能确保它按预期工作。


2
投票

这将实现你想要的。它是为 Perl 程序编写的,格式可能会根据您使用的语言而有所不同

/(?! [^>]+ \b(?:item|range)= ) (<[a-z]+.*?>) /igx

下面的代码在 Perl 程序中演示了这种模式

use strict;
use warnings;

my $pattern = qr/ (?! [^>]+ \b(?:item|range)= ) (<[a-z]+.*?>) /ix;

my $str = <<'END';

<span class="inline prewrap strong">DATE:</span>    12/01/10
<span class="inline prewrap strong">MR:</span>  1234567
<span class="inline prewrap strong">DOB:</span> 12/01/65
<span class="inline prewrap strong">HISTORY OF PRESENT ILLNESS:</span>  Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

<span class="inline prewrap strong">MEDICATIONS:</span>  <span term="Advil" range="true">Advil </span>and Ibuprofen.

END

print "$_\n" foreach $str =~ /$pattern/g;

输出

<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">

0
投票
<\w+\s+(?!term).*?>(.*?)</.*?>

0
投票

我认为这个正则表达式可以正常工作。

此正则表达式将选择任何 HTML 标签的样式属性。

<\s*\w*\s*style.*?>

您可以在 https://regex101.com

上查看
© www.soinside.com 2019 - 2024. All rights reserved.