在 ElasticSearch 索引中包含大文本的字段中搜索某些 关键字的最佳方法是什么?
我想在名为
my_field
的字段中搜索一些单词,并具有以下约束:
open ai
或 openai
(小写)。我希望搜索所有这些组合,但优先考虑完全匹配的结果。我们举个例子。我的话是:
cto
open
ai
所以我可以将它们分开或像字符串一样对待
"cto open ai"
,以谷歌搜索引擎的方式。这些词也可以是:
cto
openai
因为它们来自从文本中提取关键字的算法,并且可以将唯一关键字拆分为 2 个“常见”单词或不拆分。
我想要作为第一个结果的文档有一个
my_field
,其中包含一个长文本:".....cto.....open ai..."
。所以我尝试使用 match
查询,因为我读到有 fuzziness
参数来控制 Levenshtein 距离。
通过这 2 个查询,找到结果:
查询ok 1(含3个术语的模糊性
0
):✅
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "fuzziness": "0" }}},
{ "match": { "my_field": { "query": "open", "fuzziness": "0" }}},
{ "match": { "my_field": { "query": "ai", "fuzziness": "0" }}}
],
"minimum_should_match" : 1
}
}
}
查询ok 2(含1个字符串的模糊性
0
):✅
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto open ai", "fuzziness": "0" }}}
],
"minimum_should_match" : 1
}
}
}
(即使我改变了
query
中的单词顺序)。
但我想找到相同的结果,即使:
open ai
openai
,因为它有一点变化/拼写错误。所以我尝试了:
查询错误3(模糊性
AUTO
有2个术语和拼写错误):❌
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "openai", "fuzziness": "AUTO" }}}
],
"minimum_should_match" : 1
}
}
}
但它在它之前找到了其他结果,奇怪的是,如果我使用与情况 1 相同的查询,但用
AUTO
代替 0
,它会找到之前的其他文档,可能只有 1/3 个单词在 my_field
中,而不是全部 3 个中。虽然我知道 1 个文档完全包含所有 3 个单词,所以我不明白为什么这没有优先考虑:
查询错误4(模糊性
AUTO
与之前与0
一起使用的3个原始术语):❌
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "open", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "ai", "fuzziness": "AUTO" }}}
],
"minimum_should_match" : 1
}
}
}
我也尝试了混合方法,在没有
boost
的情况下给比赛打了 "fuzziness"="AUTO"
,但没有运气:
查询错误5(2个术语和拼写错误混合模糊):❌
GET my_index/_search
{
"query": {
"bool": {
"should": [
{ "match": { "my_field": { "query": "cto", "boost": 10 }}},
{ "match": { "my_field": { "query": "openai", "boost": 10 }}},
{ "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}},
{ "match": { "my_field": { "query": "openai", "fuzziness": "AUTO" }}}
],
"minimum_should_match" : 1
}
}
}
那么我怎样才能使查询灵活地适应所有这些拼写错误/小变化,并查看完全包含可能组合的优先文档?
我会对 my_field 建立两次索引,一次按原样索引,然后第二次索引,我首先会按情况拆分单词,然后使用 shingle 过滤器将单词组合成二元组。在搜索中,我将搜索原始字段和二元组字段,从而为原始字段提供更高的提升。
有不同的方法可以做到这一点,具体取决于您想要匹配提升级别的单词数量等,但希望这个示例能够帮助您入门:
DELETE my_index
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"tuples_index": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false,
"token_separator": ""
},
"tuples_search": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": true,
"token_separator": ""
}
},
"analyzer": {
"standard_shingle_index": {
"tokenizer": "standard",
"filter": [ "word_delimiter", "lowercase", "tuples_index" ]
},
"standard_shingle_search": {
"tokenizer": "standard",
"filter": [ "word_delimiter", "lowercase", "tuples_search" ]
}
}
}
},
"mappings": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"tuples": {
"type": "text",
"analyzer": "standard_shingle_index",
"search_analyzer": "standard_shingle_search"
}
}
}
}
}
}
PUT my_index/_bulk?refresh
{"index": {}}
{"my_field": "Mira Murati (born 1988) is a United States-based, Albanian-born engineer, researcher and business executive. She is currently the chief technology officer of OpenAI, the artificial intelligence research company that develops ChatGPT." }
{"index": {}}
{"my_field": "Women You Should Know: Mira Murati, CTO of Open A.I." }
GET my_index/_validate/query?explain
GET my_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"my_field": {
"query": "OpenAI",
"boost": 2
}
}
},
{
"match": {
"my_field.tuples": {
"query": "OpenAI"
}
}
}
]
}
}
}
GET my_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"my_field": {
"query": "Open AI",
"boost": 2
}
}
},
{
"match": {
"my_field.tuples": {
"query": "Open AI"
}
}
}
]
}
}
}