检索 Milvus 矢量数据库中的所有条目只是为了查看?

问题描述 投票:0回答:1

我是数据库管理新手,在我的项目中,我需要使用矢量数据库来存储矢量数据。为此我选择了 Milvus。我计划实现删除功能,以便用户可以删除条目,但用户可能会忘记他们想要删除的名称或 ID。为了解决这个问题,我还开发了一个“列出所有”功能,允许用户查看数据库中的所有条目。

这是我的数据库的结构(作为字典):

DB = {
    "id": id,
    "vector": vector,
    "file_name": name,
}

目前,我正在加载所有这样的名称:

client = MilvusClient(r'/my.db')
output = client.query('milvus', filter="id >= 0", output_fields=["file_name"])

这种方法在技术上可行,但似乎效率低下——每次用户想要查看条目时都加载所有内容,感觉不可扩展。我担心,随着数据集的增长,这种方法可能会导致性能问题甚至服务器崩溃。

所以,我的问题是:

  1. 这种方法合乎逻辑且可扩展吗?
  2. 是否有更好的方法从 Milvus 检索所有 file_name 条目而不加载所有内容?

任何有关在 Milvus 中处理此问题的有效方法的见解将不胜感激。

python database milvus
1个回答
0
投票

如果您正在寻找: 如果有其他方法可以列出 Milvus 中集合中的所有 id Milvus 2 - 获取集合中的 id 列表

milvus 有一个方法

query()
从集合中获取实体。 假设有一个名为
aaa
的集合,它有一个名为
id
的字段,假设所有 id 值都大于 0

collection = Collection("aaa")
result = collection.query(expr="id >= 0")
print(result)

结果是一个列表,你会看到所有的id都在这个列表中。

import random
from pymilvus import (
    connections,
    utility,
    FieldSchema, CollectionSchema, DataType,
    Collection,
)
from random import choice
from string import ascii_uppercase



print("start connecting to Milvus")
connections.connect("default", host="localhost", port="19530")

collection_name = "aaa"
if utility.has_collection(collection_name):
    utility.drop_collection(collection_name)

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128),
    FieldSchema(name="name", dtype=DataType.VARCHAR, dim=100)
]

schema = CollectionSchema(fields, "aaa")

print("Create collection", collection_name)
collection = Collection(collection_name, schema)

print("Start inserting entities")
num_entities = 10000
for k in range(50):
    print('No.', k)
    entities = [
        # [i for i in range(num_entities)], # duplicate id, the query will get 10000 ids
        [i + num_entities*k for i in range(num_entities)],  # unique id, the query will get 500000 ids
        [[random.random() for _ in range(128)] for _ in range(num_entities)],
        [[''.join(choice(ascii_uppercase) for i in range(100))] for _ in range(num_entities)],
    ]
    insert_result = collection.insert(entities)

print(f"Number of entities: {collection.num_entities}")
print("Start loading")
collection.load()

result = collection.query(expr="id >= 0")
print("query result count:", len(result))

但是如果您正在寻找: 有什么方法可以从 Milvus 集合中检索这些嵌入吗? 从 Milvus 集合中检索数据

在 Milvus 中,

search()
是进行 ANN 搜索
query()
是检索数据。 由于 milvus 针对 ANN 搜索 进行了优化,因此它将索引数据加载到内存中,但原始嵌入数据保留在磁盘中。因此,检索嵌入是一项繁重的操作,而且速度不快。 以下脚本是如何使用
query()
的简单示例:

import random

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility,
)

_HOST = '127.0.0.1'
_PORT = '19530'

if __name__ == '__main__':
    connections.connect(host=_HOST, port=_PORT)

    collection_name = "demo"
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)

    # create a collection with these fields: id, tag and vector
    dim = 8
    field1 = FieldSchema(name="id_field", dtype=DataType.INT64, is_primary=True)
    field2 = FieldSchema(name="tag_field", dtype=DataType.VARCHAR, max_length=64)
    field3 = FieldSchema(name="vector_field", dtype=DataType.FLOAT_VECTOR, dim=dim)
    schema = CollectionSchema(fields=[field1, field2, field3])
    collection = Collection(name="demo", schema=schema)
    print("collection created")

    # each vector field must have an index
    index_param = {
        "index_type": "HNSW",
        "params": {"M": 48, "efConstruction": 500},
        "metric_type": "L2"}
    collection.create_index("vector_field", index_param)

    # insert 1000 rows, each row has an id , tag and a vector
    count = 1000
    data = [
        [i for i in range(count)],
        [f"tag_{i%100}" for i in range(count)],
        [[random.random() for _ in range(dim)] for _ in range(count)],
    ]
    collection.insert(data)
    print(f"insert {count} rows")

    # must load the collection before any search or query operations
    collection.load()

    # method to retrieve vectors from the collection by filer expression
    def retrieve(expr: str):
        print("===============================================")
        result = collection.query(expr=expr, output_fields=["id_field", "tag_field", "vector_field"])
        print("query result with expression:", expr)
        for hit in result:
            print(f"id: {hit['id_field']}, tag: {hit['tag_field']}, vector: {hit['vector_field']}")

    # get items whose id = 10 or 50
    retrieve("id_field in [10, 50]")

    # get items whose id <= 3
    retrieve("id_field <= 3")

    # get items whose tag = "tag_5"
    retrieve("tag_field in [\"tag_25\"]")

    # drop the collection
    collection.drop()

脚本的输出:

collection created
insert 1000 rows
===============================================
query result with expression: id_field in [10, 50]
id: 10, tag: tag_10, vector: [0.053770524, 0.83849007, 0.04007046, 0.16028273, 0.2640955, 0.5588169, 0.93378043, 0.031373363]
id: 50, tag: tag_50, vector: [0.082208894, 0.09554817, 0.8288978, 0.984166, 0.0028912988, 0.18656737, 0.26864904, 0.20859942]
===============================================
query result with expression: id_field <= 3
id: 0, tag: tag_0, vector: [0.60005647, 0.5609647, 0.36438486, 0.10851263, 0.65043026, 0.82504696, 0.8862855, 0.79214275]
id: 1, tag: tag_1, vector: [0.3711398, 0.0068489416, 0.004352187, 0.36848867, 0.9881858, 0.9160333, 0.5137728, 0.16045558]
id: 2, tag: tag_2, vector: [0.10995998, 0.24792045, 0.75946856, 0.6824144, 0.5848432, 0.10871549, 0.81346315, 0.5030568]
id: 3, tag: tag_3, vector: [0.38349515, 0.9714319, 0.81812894, 0.387037, 0.8180231, 0.030460497, 0.411488, 0.5743198]
===============================================
query result with expression: tag_field in ["tag_25"]
id: 25, tag: tag_25, vector: [0.8417967, 0.07186894, 0.64750504, 0.5146622, 0.68041337, 0.80861133, 0.6490419, 0.013803678]
id: 125, tag: tag_25, vector: [0.41458654, 0.13030894, 0.21482174, 0.062191084, 0.86997706, 0.4915581, 0.0478688, 0.59728557]
id: 225, tag: tag_25, vector: [0.4143869, 0.26847556, 0.14965168, 0.9563254, 0.7308634, 0.5715891, 0.37524575, 0.19693129]
id: 325, tag: tag_25, vector: [0.07538631, 0.2896633, 0.8130047, 0.9486398, 0.35597774, 0.41200536, 0.76178575, 0.63848394]
id: 425, tag: tag_25, vector: [0.3203018, 0.8246632, 0.28427872, 0.3969012, 0.94882655, 0.7670139, 0.43087512, 0.36356103]
id: 525, tag: tag_25, vector: [0.52027494, 0.2197635, 0.14136001, 0.081981435, 0.10024931, 0.40981093, 0.92328817, 0.32509744]
id: 625, tag: tag_25, vector: [0.2729753, 0.85121, 0.028014379, 0.32854447, 0.5946417, 0.2831049, 0.6444559, 0.57294136]
id: 725, tag: tag_25, vector: [0.98359156, 0.90887356, 0.26763296, 0.33788496, 0.9277225, 0.4743232, 0.5850919, 0.5116082]
id: 825, tag: tag_25, vector: [0.90271956, 0.31777886, 0.8150854, 0.37264413, 0.756029, 0.75934476, 0.07602229, 0.21065433]
id: 925, tag: tag_25, vector: [0.009773289, 0.352051, 0.8339834, 0.4277803, 0.53999937, 0.2620487, 0.4906858, 0.77002776]

Process finished with exit code 0

© www.soinside.com 2019 - 2024. All rights reserved.