如何根据Elasticsearch中最近的和最高优先级的文档获取唯一的字段值?

问题描述 投票:0回答:1

我有一个 Elasticsearch 索引,用于存储具有以下字段的文档:

  • 时间戳(日期)
  • 优先级(整数)
  • user_id(字符串)

我需要从最新且优先级最高的文档中找到 5 个唯一的 user_id 值。如果我使用 SQL,我会通过以下方式解决这个问题:

SELECT DISTINCT user_id
FROM (
    SELECT user_id
    FROM documents
    ORDER BY timestamp DESC, priority DESC
) AS sorted_docs
LIMIT 5;
  1. 使用时间戳和优先级字段按降序对数据进行排序。
  2. 选择不同的user_id,从排序结果中选取前5个。

但是,我不确定如何在 Elasticsearch 中执行类似的操作。我已经研究过使用术语和复合聚合,但我不清楚:

  1. 在选择唯一的 user_id 之前,确保数据按时间戳和优先级排序。
  2. 根据以下条件将结果限制为前 5 个唯一的 user_id 值 这个排序。

有人可以指导我如何构建此查询或提供示例吗?任何帮助或参考将不胜感激!

sorting elasticsearch distinct aggregation
1个回答
0
投票

您尚未共享任何示例数据,但根据您提供的信息,我创建了示例数据并提供了示例查询。

您可以使用 Elasticsearch 的 Collapse 功能来获取 user_id 的不同值。

让我们考虑以下是您的索引映射:

PUT test
{
  "mappings": {
    "properties": {
      "user_id":{
        "type": "keyword"
      },
      "timestamp":{
        "type": "date"
      },
     "priority":{
        "type": "integer"
      }
    }
  }
}

以下是示例数据:

{ "index" : { "_index" : "test", "_id" : "1" } }
{ "user_id" : "101", "timestamp" : "2024-11-05T12:10:30Z","priority" : "1"}
{ "index" : { "_index" : "test", "_id" : "2" } }
{ "user_id" : "101", "timestamp" : "2024-11-05T16:10:30Z","priority" : "2"}
{ "index" : { "_index" : "test", "_id" : "3" } }
{ "user_id" : "102", "timestamp" : "2024-11-04T12:10:30Z","priority" : "3"}
{ "index" : { "_index" : "test", "_id" : "4" } }
{ "user_id" : "102", "timestamp" : "2024-11-05T08:10:30Z","priority" : "4"}
{ "index" : { "_index" : "test", "_id" : "5" } }
{ "user_id" : "103", "timestamp" : "2024-11-05T12:10:30Z","priority" : "5"}
{ "index" : { "_index" : "test", "_id" : "6" } }
{ "user_id" : "103", "timestamp" : "2024-11-03T12:10:30Z","priority" : "6"}
{ "index" : { "_index" : "test", "_id" : "7" } }
{ "user_id" : "104", "timestamp" : "2024-11-05T12:10:30Z","priority" : "7"}
{ "index" : { "_index" : "test", "_id" : "8" } }
{ "user_id" : "105", "timestamp" : "2024-11-04T12:10:30Z","priority" : "8"}
{ "index" : { "_index" : "test", "_id" : "9" } }
{ "user_id" : "105", "timestamp" : "2024-11-01T12:10:30Z","priority" : "9"}
{ "index" : { "_index" : "test", "_id" : "10" } }
{ "user_id" : "106", "timestamp" : "2024-11-02T12:10:30Z","priority" : "10"}

您可以使用下面的查询,它将按时间戳和优先级排序,并按 user_id 折叠,这将确保您获得所有唯一的用户 ID:

POST test/_search
{
  "size": 5, 
  "sort": [
    {
      "timestamp": {
        "order": "desc"
      }
    },
    {
      "priority": {
        "order": "desc"
      }
    }
  ], 
  "query": {
    "match_all": {}
  },
  "collapse": {
    "field": "user_id"
  }
}

以下将回复上述询问:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "test",
        "_id": "2",
        "_score": null,
        "_source": {
          "user_id": "101",
          "timestamp": "2024-11-05T16:10:30Z",
          "priority": "2"
        },
        "fields": {
          "user_id": [
            "101"
          ]
        },
        "sort": [
          1730823030000,
          2
        ]
      },
      {
        "_index": "test",
        "_id": "7",
        "_score": null,
        "_source": {
          "user_id": "104",
          "timestamp": "2024-11-05T12:10:30Z",
          "priority": "7"
        },
        "fields": {
          "user_id": [
            "104"
          ]
        },
        "sort": [
          1730808630000,
          7
        ]
      },
      {
        "_index": "test",
        "_id": "5",
        "_score": null,
        "_source": {
          "user_id": "103",
          "timestamp": "2024-11-05T12:10:30Z",
          "priority": "5"
        },
        "fields": {
          "user_id": [
            "103"
          ]
        },
        "sort": [
          1730808630000,
          5
        ]
      },
      {
        "_index": "test",
        "_id": "4",
        "_score": null,
        "_source": {
          "user_id": "102",
          "timestamp": "2024-11-05T08:10:30Z",
          "priority": "4"
        },
        "fields": {
          "user_id": [
            "102"
          ]
        },
        "sort": [
          1730794230000,
          4
        ]
      },
      {
        "_index": "test",
        "_id": "8",
        "_score": null,
        "_source": {
          "user_id": "105",
          "timestamp": "2024-11-04T12:10:30Z",
          "priority": "8"
        },
        "fields": {
          "user_id": [
            "105"
          ]
        },
        "sort": [
          1730722230000,
          8
        ]
      }
    ]
  }
}

我认为 10 是最高优先级,因为您没有在问题中澄清。如果您希望 1 作为最高优先级,请将排序中的优先级顺序从

asc
更改为
desc

© www.soinside.com 2019 - 2024. All rights reserved.