当覆盖现有实体时，Milvus 集合中的 num_entities 会增加什么

Question

我原以为

num_entities

会指示 Milvus 集合中的记录数量（或任何正确的术语）。但是，我创建了 1 个文件 -

test_milvus.py

来创建一个简单的集合，如下所示：

import numpy as np
from pymilvus import connections, Collection, CollectionSchema, FieldSchema, DataType

connections.connect(alias='default',host='localhost', port='19530')

# Define the schema
schema = CollectionSchema([FieldSchema("id", DataType.INT64, is_primary=True, max_length=100),
                           FieldSchema("vector", DataType.FLOAT_VECTOR, dim=2)])

# Create a collection
collection = Collection("test_collection", schema)

# Insert data
data = [{"id":i, "vector": np.array([i, i],dtype=np.float32)} for i in range(10)]
collection.insert(data)

# Flush data
collection.flush()

# Disconnect from the server
connections.disconnect(alias='default')

另一个用于获取 Milvus 数据库中集合的信息 =

milvus_info.py

- 像这样：

from pymilvus import Collection, connections, db, utility

def get_info (host: str = "localhost", port: str = "19530"):

    # Connect to Milvus (replace with your connection details)
    connections.connect(alias="default", host=host, port=port)  # Replace with your connection parameters

    # Print the list of databases and collections
    db_list = db.list_database()
    for db_name in db_list:
        print(f"Database: {db_name}")
        collection_list = utility.list_collections(using=db_name)
        if len(collection_list) == 0:
            print("  No collections")
        for collection_name in collection_list:
            print(f"  Collection: {collection_name}")
            temp_collection = Collection(name=collection_name)
            for info in temp_collection.describe():
                print(f"    {info}: {temp_collection.describe()[info]}")
            temp_collection.flush() #Note: Adding this line does not fix problem.
            print(f"   Number of entities: {temp_collection.num_entities}")
          
    # Disconnect from Milvus
    connections.disconnect(alias='default')


if __name__ == "__main__":
    get_info()

我第一次运行

test_milvus

，然后运行

milvus_info.py

，我得到了以下输出：

  $ python test_milvus.py 
  $ python milvus_info.py 
    Database: default
      Collection: test_collection
        collection_name: test_collection
        auto_id: False
        num_shards: 1
        description: 
        fields: [{'field_id': 100, 'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}, 'is_primary': True}, {'field_id': 101, 'name': 'vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 2}}]
        aliases: []
        collection_id: 450687678279804785
        consistency_level: 2
        properties: {}
        num_partitions: 1
        enable_dynamic_field: False
       Number of entities: 10

这让我觉得很奇怪，因为数据库中只有 2 个向量。

但是，如果我再次运行 `test_milvus.py'，即使没有添加新向量，实体的数量也会增加到 20 个：

$ python test_milvus.py 
$ python milvus_info.py 
Database: default
  Collection: test_collection
    collection_name: test_collection
    auto_id: False
    num_shards: 1
    description: 
    fields: [{'field_id': 100, 'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}, 'is_primary': True}, {'field_id': 101, 'name': 'vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 2}}]
    aliases: []
    collection_id: 450687678279804785
    consistency_level: 2
    properties: {}
    num_partitions: 1
    enable_dynamic_field: False
   Number of entities: 20

即使我只尝试添加已经存在的记录，也会发生这种情况。我预计

num_entities

为 10，无论我运行这些文件多少次。文档说它返回行数，但我可以将其任意提高，同时仍然只有 10 行。 num_entities 是否应该跟踪曾经存在的所有行？？？

注意：当我将

insert

替换为

upsert

时也会发生这种情况

Answer 1

在Python中，获取实体数量的快速方法是：打印（集合.num_entities）

但是这种方法并不准确，因为它只是通过快速从 etcd 中选取数字来计算持久化段的数量。每次持久化一个段时，Etcd 中都会记录该段的基本信息，包括其行号。 collection.num_entities 汇总所有持久化段的行号。但这个数字不包括已删除的项目。假设一个段有 1000 行，您调用 collection.delete() 从段中删除 50 行，collection.num_entities 始终为您显示 1000 行。并且 collection.num_entities 不知道哪个实体被覆盖。 Milvus 存储是基于列的，所有新数据都会追加到新的段中。如果使用 upsert() 覆盖现有实体，它还会将新实体追加到新段，并同时创建删除操作，删除操作是异步执行的。删除操作不会改变etcd中记录的该段的原始编号，因为我们不打算频繁更新etcd（对etcd进行大量更新操作会降低整个系统性能）。因此，collection.num_entities 不知道哪个实体被删除，因为 etcd 中的原始数字没有更新。此外，collection.num_entities 不计算非持久段。

collection.query(output_fields=["count(*)"]) 是一个查询请求，由查询节点执行。它计算已删除的项目以及所有段（包括非持久段）。并且 collection.query() 比 collection.num_entities 慢。

如果您没有删除/更新插入操作来删除或覆盖集合中的现有实体，那么通过 collection.num_entities 检查该集合的行号是一种快速方法。否则，您应该使用 collection.query(output_fields=["count(*)"]) 来获取准确的行号。

当覆盖现有实体时，Milvus 集合中的 num_entities 会增加什么

问题描述投票：0回答：1

1个回答

最新问题

当覆盖现有实体时，Milvus 集合中的 num_entities 会增加什么

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1