MongoDB全文搜索分数“分数是什么意思？”

Question

我正在为我的学校开发一个 MongoDB 项目。我有一个句子集合，我进行正常的文本搜索以查找集合中最相似的句子，这是基于评分的。

我运行此查询

db.sentences.find({$text: {$search: "any text"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})

当我查询句子时看看这些结果，

"that kicking a dog causes it pain"
----Matched With
"that kicking a dog causes it pain – is not very controversial."
----Give a Result of:
*score: 2.4*


"This sentence have nothing to do with any other"
----Matched With
"Who is the “He” in this sentence?"
----Give a result of:
*Score: 1.0*

分值是多少？这是什么意思？如果我想显示只有 70% 及以上相似度的结果怎么办？

如何解释分数结果以便显示相似性百分比，我使用 C# 来执行此操作，但不用担心实现。我不介意伪代码解决方案！

Answer 1

当您使用 MongoDB 文本索引时，它会为每个匹配文档生成一个分数。该分数表明您的搜索字符串与文档的匹配程度。分数越高，与搜索文本相似的机会就越大。分数计算方式为：

Step 1: Let the search text = S
Step 2: Break S into tokens (If you are not doing a Phrase search). Let's say T1, T2..Tn. Apply Stemming to each token
Step 3: For every search token, calculate score per index field of text index as follows:
       
score = (weight * data.freq * coeff * adjustment);
       
Where :
weight = user Defined Weight for any field. Default is 1 when no weight is specified
data.freq = how frequently the search token appeared in the text
coeff = (0.5 * data.count / numTokens) + 0.5
data.count = Number of matching token
numTokens = Total number of tokens in the text
adjustment = 1 (By default).If the search token is exactly equal to the document field then adjustment = 1.1
Step 4: Final score of document is calculated by adding all tokens scores per text index field
Total Score = score(T1) + score(T2) + .....score(Tn)

正如我们在上面看到的，分数受到以下因素的影响：

与实际搜索文本匹配的Term数量，匹配越多得分越多
文档字段中的标记数量
搜索到的文本是否与文档字段完全匹配

以下是您的一份文档的推导：

Search String = This sentence have nothing to do with any other
Document = Who is the “He” in this sentence?

Score Calculation:
Step 1: Tokenize search string.Apply Stemming and remove stop words.
    Token 1: "sentence"
    Token 2: "nothing"
Step 2: For every search token obtained in Step 1, do steps 3-11:
        
      Step 3: Take Sample Document and Remove Stop Words
            Input Document:  Who is the “He” in this sentence?
            Document after stop word removal: "sentence"
      Step 4: Apply Stemming 
        Document in Step 3: "sentence"
        After Stemming : "sentence"
      Step 5: Calculate data.count per search token 
              data.count(sentence)= 1
              data.count(nothing)= 1
      Step 6: Calculate total number of token in document
              numTokens = 1
      Step 7: Calculate coefficient per search token
              coeff = (0.5 * data.count / numTokens) + 0.5
              coeff(sentence) = 0.5*(1/1) + 0.5 = 1.0
              coeff(nothing) = 0.5*(1/1) + 0.5 = 1.0    
      Step 8: Calculate adjustment per search token (Adjustment is 1 by default. If the search text match exactly with the raw document only then adjustment = 1.1)
              adjustment(sentence) = 1
              adjustment(nothing) = 1
      Step 9: weight of field (1 is default weight)
              weight = 1
      Step 10: Calculate frequency of search token in document (data.freq)
           For ever ith occurrence, the data frequency = 1/(2^i). All occurrences are summed.
            a. Data.freq(sentence)= 1/(2^0) = 1
            b. Data.freq(nothing)= 0
      Step 11: Calculate score per search token per field:
         score = (weight * data.freq * coeff * adjustment);
         score(sentence) = (1 * 1 * 1.0 * 1.0) = 1.0
         score(nothing) = (1 * 0 * 1.0 * 1.0) = 0
Step 12: Add individual score for every token of search string to get total score
Total score = score(sentence) + score(nothing) = 1.0 + 0.0 = 1.0

以同样的方式，可以推导出另一个。

更详细的 MongoDB 分析，请查看： Mongo 评分算法博客

Answer 2

文本搜索为索引字段中包含搜索词的每个文档分配一个分数。分数确定文档与给定搜索查询的相关性。

对于文档中的每个索引字段，MongoDB 将匹配数乘以权重并对结果求和。然后，MongoDB 使用此总和计算文档的分数。

索引字段的默认权重为 1。

https://docs.mongodb.com/manual/tutorial/control-results-of-text-search/

Answer 3

您可以在聚合管道的后续阶段将分数标准化为 0 到 1 的范围。

例如：

pipeline = [
    {
        "$match": {
            "$and": [
                {"userId": {"$in": user_ids}},
                {
                    "$text": {
                        "$search": keywords,
                        "$caseSensitive": False,
                        "$diacriticSensitive": False,
                    },
                },
            ]
        }
    },
    {"$addFields": {"score": {"$meta": "textScore"}}},
    {"$setWindowFields": {"output": {"maxScore": {"$max": "$score"}}}},
    {"$addFields": {"normalizedScore": {"$divide": ["$score", "$maxScore"]}}},
    {"$match": {"normalizedScore": {"$gte": 0.7}}},
    {"$sort": {"normalizedScore": -1}},
]

在上面的示例中，我需要类似的功能：

创建聚合管道以按 id 搜索和过滤到我的集合中
添加分数字段以保存与搜索相似的文档分数词
计算并创建所有搜索的最高分数结果
添加normalizedScore字段来存储标准化值在 0 到 1 范围内
最后我使用标准化分数来限制和对结果进行排序。

我基于下一个 mongodb 文档：标准化分数

MongoDB全文搜索分数“分数是什么意思？”

问题描述投票：0回答：3

3个回答

最新问题

MongoDB全文搜索分数“分数是什么意思？”

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3