我觉得这应该是一个非常简单的问题,但出于某种原因我可以解决它。
我想使用 ElasticSearch 构建一个产品搜索引擎。 我在连接单词时遇到问题,例如我想搜索 Smart watch。 我运行两个不同的查询:(1)“智能手表”和(2)“智能手表”。
在(1)中,我得到的结果在产品标题中都包含“智能手表”和“智能手表”。但是,在(2)中,我只得到具有“智能手表”的产品,我不会得到智能和智能手表之间空格的任何变化。观看:
这是我的索引配置:
config = {
"settings": {
"analysis": {
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter":["html_strip","custom_char_filter","space_maker_2", "space_maker_3" ],
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": ["space_maker_2", "space_maker_3"
],
"filter": [
"lowercase",
"asciifolding",
"synonym_apply",
"special_stopwards"
]
}
},
"char_filter": {
"custom_char_filter": {
"type": "mapping",
"mappings": [
"$ => dollar"
]
},
"space_maker_1": {
"type": "pattern_replace",
"pattern": "(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[a-z])",
"replacement": " "
},
"space_maker_2": {
"type": "pattern_replace",
"pattern": "(?<=\\p{Digit})(?=\\p{Alpha})|(?<=\\p{Alpha})(?=\\p{Digit})",
"replacement": " "
},
"space_maker_3": {
"type": "pattern_replace",
"pattern": "(?<=[a-zA-Z0-9])(?=[^a-zA-Z0-9])|(?<=[^a-zA-Z0-9])(?=[a-zA-Z0-9])",
"replacement": " "
}
},
"filter": {
"nGram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
"synonym_apply": {
"type": "synonym",
"lenient": "true",
"synonyms": [ "kilo, kilogram => kg",
"buck, dollar => usd"
]
},
"special_stopwards": {
"type": "stop",
"stopwords": [ "ass", "butt" ]
}
}
}
},
"mappings": {
"properties": {
"brand": {
"type": "keyword"
},
"category": {
"type": "keyword"
},
"tags": {
"type": "keyword"
},
"domain": {
"type": "keyword"
},
"image": {
"type": "text"
},
"purchases": {
"type": "double"
},
"views": {
"type": "double"
},
"price": {
"type": "double"
},
"product_id": {
"type": "text"
},
"product_url": {
"type": "text"
},
"title": {
"type": "text",
"analyzer": "nGram_analyzer",
"search_analyzer": "nGram_analyzer",
},
"description": {
"type": "text"
},
"country": {
"type": "integer"
},
"last_seen_date": {
"type": "text"
}
}
}
}
我目前只使用简单的匹配查询产品标题。
如何更改查询或索引来解决此问题?或者它甚至可以解决吗?
你的问题是
nGram_analyzer
。它用于索引时间和查询时间。
让文档有一个标题“智能手表”。标题被
nGram_filter
: 分割为标记
sm、sma、smar、smart、wa、wat、watc、手表
查询文本“smartwatch”也被
nGram_filte
r 分割成标记:
sm、sma、smar、smart、smartw、smartwa、smartwat、smartwatc、smartwatch
Elasticsearch 搜索匹配 4 个标记(termFreq):sm、sma、smar、smart,并将标题为“smart watch”的文档添加到命中中。
尝试这些查询来检查上面的文本
在以下查询的回复中查找文本
termFreq=4.0
。
POST /<your index>/_explain/<id of document with title "smart watch">
{
"query": {
"match": {
"title": "smartwatch"
}
}
}
是的,这个问题可以通过替换 search_analyzer 来解决
POST /<your index>/_search
{
"query": {
"match": {
"title": {
"query": "smartwatch",
"analyzer": "keyword"
}
}
}
}