Opensearch（或 Elasticsearch）中的正则表达式过滤器

Question

这是在

json

中索引的示例

Opensearch

文档：

{
  "_index": "filebeat-7.12.1-2024.08.28",
  "_type": "_doc",
  "_id": "RF64mZEBFMf-66jeR0WD",
  "_version": 1,
  "_score": null,
  "_source": {
    "cloud": {},
    "message": "%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl            : Query from ES took:1.5s",
    "event": {
      "created": "2024-08-28T18:01:15.557Z"
    }
  },
  "fields": {
    "event.created": [
      "2024-08-28T18:01:15.557Z"
    ]
  },
  "highlight": {
    "logger.type": [
      "@opensearch-dashboards-highlighted-field@WLS@/opensearch-dashboards-highlighted-field@"
    ],
    "message": [
      "%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl            : Query from ES took:@[email protected]@/opensearch-dashboards-highlighted-field@"
    ]
  },
  "sort": [
    1,
    1724868075557
  ]
}

我希望

regexp

过滤字段

message

这里是它的映射

        "message" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }

使用此 DSL 过滤器来

regexp

匹配消息字段的时间部分：

{
  "query": {
    "regexp": {
      "message": {
        "value": "[0-9]\\.?[0-9]*s"
      }
    }
  }
}

使用此 DSL 过滤器

regexp

匹配消息字段的整个文本部分失败:

{
  "query": {
    "regexp": {
      "message": {
        "value": "Q.*[0-9]\\.?[0-9]*s"
      }
    }
  }
}

此 DSL 过滤器也失败：

{
  "query": {
    "regexp": {
      "message.keyword": {
        "value": "Q.*[0-9]\\.?[0-9]*s"
      }
    }
  }
}

上面示例中匹配的消息字段文本值：

"%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl            : Query from ES took:1.5s"

正则表达式模式的差异：

"value": "Q.*[0-9]\\.?[0-9]*s"
"value":    "[0-9]\\.?[0-9]*s"

请建议使用

"Query from ES took:[0-9]\\.?[0-9]*s"

等正则表达式模式的 DSL 过滤器来匹配

Query from ES took:12.553s

等文本

时间数字的范围可以从0到999.999

Answer 1

您正在将此映射用于消息字段：

{
  "message": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

如果您使用的是标准标记生成器并且正在使用此查询，则消息字段将被标记，正则表达式将在标记中搜索匹配项，其中标记之一是

1.5s

，因此存在匹配项：

{
  "query": {
    "regexp": {
      "message": {
        "value": "[0-9]\\.?[0-9]*s"
      }
    }
  }
}

如果您使用此查询：

{
  "query": {
    "regexp": {
      "message.keyword": {
        "value": "Q.*[0-9]\\.?[0-9]*s"
      }
    }
  }
}

您正在搜索的关键字字段未经过分析，应该具有完全匹配。如果您使用正则表达式，您应该通过将正则表达式更新为来匹配整个字段：

{
  "query": {
    "regexp": {
      "message.keyword": {
        "value": ".*Q.*[0-9]\\.?[0-9]*s"
      }
    }
  }
}

如果最后一个

字符后还有更多文本，您可以将行的其余部分与：

"value": ".*Q.*[0-9]\\.?[0-9]*s.*"

注意，您可以通过使用 _analyze API 通过使用此有效负载发出 POST 请求来测试令牌的外观：

{
  "analyzer": "standard",
  "text": "%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl            : Query from ES took:1.5s"
}

然后你会看到有一个token

"token": "1.5s"

文档指出：

标准分词器提供基于语法的分词（基于 Unicode 文本分段算法，如 Unicode 中所指定标准附件#29）并且适用于大多数语言。

有一个关于“字边界规则”的部分 https://unicode.org/reports/tr29/#Word_Boundary_Rules其中提到：

不要在序列中打断，例如“3.2”或“3,456.789”。

因此，消息字段的初始正则表达式

[0-9]\\.?[0-9]*s

匹配

1.5s

Opensearch（或 Elasticsearch）中的正则表达式过滤器

问题描述投票：0回答：1

1个回答

最新问题

Opensearch（或 Elasticsearch）中的正则表达式过滤器

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1