ElasticSearch 搜索可能存在拼写错误的关键字列表

Question

在 ElasticSearch 索引中包含大文本的字段中搜索某些 关键字的最佳方法是什么？

我想在名为

my_field

的字段中搜索一些单词，并具有以下约束：

我可以将单词列表作为单独的元素传递，也可以作为带有分隔符（如空格）的单个字符串一起传递，重要的是每个单词都会被搜索
单词可能包含拼写错误或可以用不同的方式书写，例如 OpenAI 可以写成
```
open ai
```
或
```
openai
```
（小写）。我希望搜索所有这些组合，但优先考虑完全匹配的结果。

我们举个例子。我的话是：

```
cto
```
```
open
```
```
ai
```

所以我可以将它们分开或像字符串一样对待

"cto open ai"

，以谷歌搜索引擎的方式。这些词也可以是：

```
cto
```
```
openai
```

因为它们来自从文本中提取关键字的算法，并且可以将唯一关键字拆分为 2 个“常见”单词或不拆分。

我想要作为第一个结果的文档有一个

my_field

，其中包含一个长文本：

".....cto.....open ai..."

。所以我尝试使用

match

查询，因为我读到有

fuzziness

参数来控制 Levenshtein 距离。

通过这 2 个查询，找到结果：

查询ok 1（含3个术语的模糊性

）：✅

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "0" }}}, 
        { "match": { "my_field": { "query": "open", "fuzziness": "0"  }}},
        { "match": { "my_field": { "query": "ai", "fuzziness": "0"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

查询ok 2（含1个字符串的模糊性

）：✅

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto open ai", "fuzziness": "0" }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

（即使我改变了

query

中的单词顺序）。

但我想找到相同的结果，即使：

文字包含
```
open ai
```
我的查询有
```
openai
```
，因为它有一点变化/拼写错误。

所以我尝试了：

查询错误3（模糊性

AUTO

有2个术语和拼写错误）：❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "openai", "fuzziness": "AUTO"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

但它在它之前找到了其他结果，奇怪的是，如果我使用与情况 1 相同的查询，但用

AUTO

代替

，它会找到之前的其他文档，可能只有 1/3 个单词在

my_field

中，而不是全部 3 个中。虽然我知道 1 个文档完全包含所有 3 个单词，所以我不明白为什么这没有优先考虑：

查询错误4（模糊性

AUTO

与之前与
0
一起使用的3个原始术语）：❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "open", "fuzziness": "AUTO"  }}},
        { "match": { "my_field": { "query": "ai", "fuzziness": "AUTO"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

我也尝试了混合方法，在没有

boost

的情况下给比赛打了

"fuzziness"="AUTO"

，但没有运气：

查询错误5（2个术语和拼写错误混合模糊）：❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "boost": 10 }}}, 
        { "match": { "my_field": { "query": "openai", "boost": 10  }}},
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "openai", "fuzziness": "AUTO" }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

那么我怎样才能使查询灵活地适应所有这些拼写错误/小变化，并查看完全包含可能组合的优先文档？

Answer 1

我会对 my_field 建立两次索引，一次按原样索引，然后第二次索引，我首先会按情况拆分单词，然后使用 shingle 过滤器将单词组合成二元组。在搜索中，我将搜索原始字段和二元组字段，从而为原始字段提供更高的提升。

有不同的方法可以做到这一点，具体取决于您想要匹配提升级别的单词数量等，但希望这个示例能够帮助您入门：

DELETE my_index
PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "tuples_index": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 2,
          "output_unigrams": false,
          "token_separator": ""
        },
        "tuples_search": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 2,
          "output_unigrams": true,
          "token_separator": ""
        }
      }, 
      "analyzer": {
        "standard_shingle_index": {
          "tokenizer": "standard",
          "filter": [ "word_delimiter", "lowercase", "tuples_index" ]
        },
        "standard_shingle_search": {
          "tokenizer": "standard",
          "filter": [ "word_delimiter", "lowercase", "tuples_search" ]
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "fields": {
          "tuples": {
            "type": "text",
            "analyzer": "standard_shingle_index",
            "search_analyzer": "standard_shingle_search"
          }
        }
      }
    }
  }
}

PUT my_index/_bulk?refresh
{"index": {}}
{"my_field": "Mira Murati (born 1988) is a United States-based, Albanian-born engineer, researcher and business executive. She is currently the chief technology officer of OpenAI, the artificial intelligence research company that develops ChatGPT." }
{"index": {}}
{"my_field": "Women You Should Know: Mira Murati, CTO of Open A.I." }

GET my_index/_validate/query?explain

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_field": {
              "query": "OpenAI",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "my_field.tuples": {
              "query": "OpenAI"
            }
          }
        }
      ]
    }
  }
}

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_field": {
              "query": "Open AI",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "my_field.tuples": {
              "query": "Open AI"
            }
          }
        }
      ]
    }
  }
}

ElasticSearch 搜索可能存在拼写错误的关键字列表

问题描述投票：0回答：1

1个回答

最新问题

ElasticSearch 搜索可能存在拼写错误的关键字列表

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1