为什么 Elasticsearch 总分在数字较大时会变得不准确?

问题描述 投票:0回答:1

如果我有一个看起来像这样的索引:

[{:index=>
   {:_index=>"candidates",
    :_id=>"a1786607-e095-4621-bdf9-de2706475614",
    :data=>
     {:name=>"Carli Stark",
      :is_verified=>true, :has_work_permit=>true}}},
 {:index=>
   {:_index=>"candidates",
    :_id=>"57f78d3f-392e-4cdf-a5ff-6d10e7c89d5b",
    :data=>
     {:name=>"Gayla Keeling",
      :is_verified=>false, :has_work_permit=>true}}}]

我使用 Score_mode sum 和 boost_mode replace 进行查询(因为我只想考虑我的相关性分数):

GET candidates/_search
{
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "functions": [
        {
          "filter": {
            "term": {
              "is_verified": true
            }
          },
          "weight": 1000
        },
        {
          "filter": {
            "term": {
              "has_work_permit": true
            }
          },
          "weight": 100000000000
        }
      ],
      "score_mode": "sum",
      "boost_mode": "replace"
    }
  },
  "_source": ["is_verified"],
  "size": 50
}

那么为什么 Elasticsearch 对两个文档返回完全相同的分数? (还要注意顺序错了)

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 9.9999998E10,
    "hits" : [
      {
        "_index" : "candidates_production_20240828114811152",
        "_type" : "_doc",
        "_id" : "cbd1b70b-f889-4136-a43e-f6782955f58e",
        "_score" : 9.9999998E10,
        "_source" : {
          "is_verified" : false,
          "has_work_permit" : true
        }
      },
      {
        "_index" : "candidates_production_20240828114811152",
        "_type" : "_doc",
        "_id" : "d644a5e5-09e0-496e-8830-c1a772c46611",
        "_score" : 9.9999998E10,
        "_source" : {
          "is_verified" : true
          "has_work_permit" : true
        }
      }
    ]
  }
}

如果我使用更大的权重(例如 10000 而不是 1000),那么分数会与预期不同:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.00000006E11,
    "hits" : [
      {
        "_index" : "candidates_production_20240828114811152",
        "_type" : "_doc",
        "_id" : "d644a5e5-09e0-496e-8830-c1a772c46611",
        "_score" : 1.00000006E11,
        "_source" : {
          "is_verified" : true,
          "has_work_permit" : true
        }
      },
      {
        "_index" : "candidates_production_20240828114811152",
        "_type" : "_doc",
        "_id" : "cbd1b70b-f889-4136-a43e-f6782955f58e",
        "_score" : 9.9999998E10,
        "_source" : {
          "is_verified" : false
          "has_work_permit" : true
        }
      }
    ]
  }
}

但是如何使其准确呢?无论分数有多大,我都需要在分数中考虑较小的权重。

我的 Elasticsearch 版本是 7.10.1 (AWS ES)

elasticsearch
1个回答
0
投票

这是因为分数存储为浮点型,它只能精确表示最多 2^24 的整数。

相关问题在这里。

我的解决方案是将 sqrt 函数应用于权重,以使它们具有较小的数字但相似的分布。

© www.soinside.com 2019 - 2024. All rights reserved.