Elasticsearch 根据令牌内容在索引时添加同义词

问题描述 投票:0回答:1

我想做的是将法语单词以多种形式索引为同义词。例如,要按原样索引的 l'ami 加上两个同义词:“lami”和“l ami”,所以这个词的同义词图看起来像这样:

---l---ami--
|          |
---l'ami----
|          |
---lami-----

可以使用条件标记过滤器来检查单词中是否存在撇号(我将事先使用字符过滤器规范所有撇号类型),并在这种情况下应用同义词或某种过滤器。

有没有一种方法可以根据在字符串中找到某个字符的条件在索引/查询时动态添加同义词?

elasticsearch lucene
1个回答
0
投票

您的解决方案是

multiplexer
过滤器。它允许以各种方式过滤令牌

使用

condition
过滤器和
multiplexer

进行映射
PUT /dynamic_synonyms
{
    "settings": {
        "analysis": {
            "analyzer": {
                "dynamic_synonym_analyzer": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "elision_detect_filter"
                    ]
                }
            },
            "filter": {
                "dynamic_synonym_filter": {
                    "type": "multiplexer",
                    "filters": [
                        "apostroph_remove_filter",
                        "lowercase",
                        "apostroph_space_replace_filter"
                    ]
                },
                "apostroph_space_replace_filter": {
                    "type": "pattern_replace",
                    "pattern": "'",
                    "replacement": " "
                },
                "apostroph_remove_filter": {
                    "type": "pattern_replace",
                    "pattern": "'",
                    "replacement": ""
                },
                "elision_detect_filter": {
                    "type": "condition",
                    "filter": [
                        "dynamic_synonym_filter"
                    ],
                    "script": {
                        "source": """token.term.toString().startsWith('l\'')"""
                    }
                }
            }
        }
    }
}

dynamic_synonym_filter
中的小写过滤器是一个noop过滤器

分析字符串

POST /dynamic_synonyms/_analyze
{
    "analyzer" : "dynamic_synonym_analyzer",
    "text" : "l'ami bon"
}

回应

{
    "tokens" : [
        {
            "token" : "l'ami",
            "start_offset" : 0,
            "end_offset" : 5,
            "type" : "word",
            "position" : 0
        },
        {
            "token" : "lami",
            "start_offset" : 0,
            "end_offset" : 5,
            "type" : "word",
            "position" : 0
        },
        {
            "token" : "l ami",
            "start_offset" : 0,
            "end_offset" : 5,
            "type" : "word",
            "position" : 0
        },
        {
            "token" : "bon",
            "start_offset" : 6,
            "end_offset" : 9,
            "type" : "word",
            "position" : 1
        }
    ]
}
© www.soinside.com 2019 - 2024. All rights reserved.