我是数据库管理新手,在我的项目中,我需要使用矢量数据库来存储矢量数据。为此我选择了 Milvus。我计划实现删除功能,以便用户可以删除条目,但用户可能会忘记他们想要删除的名称或 ID。为了解决这个问题,我还开发了一个“列出所有”功能,允许用户查看数据库中的所有条目。
这是我的数据库的结构(作为字典):
DB = {
"id": id,
"vector": vector,
"file_name": name,
}
目前,我正在加载所有这样的名称:
client = MilvusClient(r'/my.db')
output = client.query('milvus', filter="id >= 0", output_fields=["file_name"])
这种方法在技术上可行,但似乎效率低下——每次用户想要查看条目时都加载所有内容,感觉不可扩展。我担心,随着数据集的增长,这种方法可能会导致性能问题甚至服务器崩溃。
所以,我的问题是:
任何有关在 Milvus 中处理此问题的有效方法的见解将不胜感激。
如果您正在寻找: 如果有其他方法可以列出 Milvus 中集合中的所有 id Milvus 2 - 获取集合中的 id 列表
milvus 有一个方法
query()
从集合中获取实体。
假设有一个名为 aaa
的集合,它有一个名为 id
的字段,假设所有 id 值都大于 0。
collection = Collection("aaa")
result = collection.query(expr="id >= 0")
print(result)
结果是一个列表,你会看到所有的id都在这个列表中。
import random
from pymilvus import (
connections,
utility,
FieldSchema, CollectionSchema, DataType,
Collection,
)
from random import choice
from string import ascii_uppercase
print("start connecting to Milvus")
connections.connect("default", host="localhost", port="19530")
collection_name = "aaa"
if utility.has_collection(collection_name):
utility.drop_collection(collection_name)
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128),
FieldSchema(name="name", dtype=DataType.VARCHAR, dim=100)
]
schema = CollectionSchema(fields, "aaa")
print("Create collection", collection_name)
collection = Collection(collection_name, schema)
print("Start inserting entities")
num_entities = 10000
for k in range(50):
print('No.', k)
entities = [
# [i for i in range(num_entities)], # duplicate id, the query will get 10000 ids
[i + num_entities*k for i in range(num_entities)], # unique id, the query will get 500000 ids
[[random.random() for _ in range(128)] for _ in range(num_entities)],
[[''.join(choice(ascii_uppercase) for i in range(100))] for _ in range(num_entities)],
]
insert_result = collection.insert(entities)
print(f"Number of entities: {collection.num_entities}")
print("Start loading")
collection.load()
result = collection.query(expr="id >= 0")
print("query result count:", len(result))
但是如果您正在寻找: 有什么方法可以从 Milvus 集合中检索这些嵌入吗? 从 Milvus 集合中检索数据
在 Milvus 中,
search()
是进行 ANN 搜索,query()
是检索数据。
由于 milvus 针对 ANN 搜索 进行了优化,因此它将索引数据加载到内存中,但原始嵌入数据保留在磁盘中。因此,检索嵌入是一项繁重的操作,而且速度不快。
以下脚本是如何使用 query()
的简单示例:
import random
from pymilvus import (
connections,
FieldSchema, CollectionSchema, DataType,
Collection,
utility,
)
_HOST = '127.0.0.1'
_PORT = '19530'
if __name__ == '__main__':
connections.connect(host=_HOST, port=_PORT)
collection_name = "demo"
if utility.has_collection(collection_name):
utility.drop_collection(collection_name)
# create a collection with these fields: id, tag and vector
dim = 8
field1 = FieldSchema(name="id_field", dtype=DataType.INT64, is_primary=True)
field2 = FieldSchema(name="tag_field", dtype=DataType.VARCHAR, max_length=64)
field3 = FieldSchema(name="vector_field", dtype=DataType.FLOAT_VECTOR, dim=dim)
schema = CollectionSchema(fields=[field1, field2, field3])
collection = Collection(name="demo", schema=schema)
print("collection created")
# each vector field must have an index
index_param = {
"index_type": "HNSW",
"params": {"M": 48, "efConstruction": 500},
"metric_type": "L2"}
collection.create_index("vector_field", index_param)
# insert 1000 rows, each row has an id , tag and a vector
count = 1000
data = [
[i for i in range(count)],
[f"tag_{i%100}" for i in range(count)],
[[random.random() for _ in range(dim)] for _ in range(count)],
]
collection.insert(data)
print(f"insert {count} rows")
# must load the collection before any search or query operations
collection.load()
# method to retrieve vectors from the collection by filer expression
def retrieve(expr: str):
print("===============================================")
result = collection.query(expr=expr, output_fields=["id_field", "tag_field", "vector_field"])
print("query result with expression:", expr)
for hit in result:
print(f"id: {hit['id_field']}, tag: {hit['tag_field']}, vector: {hit['vector_field']}")
# get items whose id = 10 or 50
retrieve("id_field in [10, 50]")
# get items whose id <= 3
retrieve("id_field <= 3")
# get items whose tag = "tag_5"
retrieve("tag_field in [\"tag_25\"]")
# drop the collection
collection.drop()
脚本的输出:
collection created
insert 1000 rows
===============================================
query result with expression: id_field in [10, 50]
id: 10, tag: tag_10, vector: [0.053770524, 0.83849007, 0.04007046, 0.16028273, 0.2640955, 0.5588169, 0.93378043, 0.031373363]
id: 50, tag: tag_50, vector: [0.082208894, 0.09554817, 0.8288978, 0.984166, 0.0028912988, 0.18656737, 0.26864904, 0.20859942]
===============================================
query result with expression: id_field <= 3
id: 0, tag: tag_0, vector: [0.60005647, 0.5609647, 0.36438486, 0.10851263, 0.65043026, 0.82504696, 0.8862855, 0.79214275]
id: 1, tag: tag_1, vector: [0.3711398, 0.0068489416, 0.004352187, 0.36848867, 0.9881858, 0.9160333, 0.5137728, 0.16045558]
id: 2, tag: tag_2, vector: [0.10995998, 0.24792045, 0.75946856, 0.6824144, 0.5848432, 0.10871549, 0.81346315, 0.5030568]
id: 3, tag: tag_3, vector: [0.38349515, 0.9714319, 0.81812894, 0.387037, 0.8180231, 0.030460497, 0.411488, 0.5743198]
===============================================
query result with expression: tag_field in ["tag_25"]
id: 25, tag: tag_25, vector: [0.8417967, 0.07186894, 0.64750504, 0.5146622, 0.68041337, 0.80861133, 0.6490419, 0.013803678]
id: 125, tag: tag_25, vector: [0.41458654, 0.13030894, 0.21482174, 0.062191084, 0.86997706, 0.4915581, 0.0478688, 0.59728557]
id: 225, tag: tag_25, vector: [0.4143869, 0.26847556, 0.14965168, 0.9563254, 0.7308634, 0.5715891, 0.37524575, 0.19693129]
id: 325, tag: tag_25, vector: [0.07538631, 0.2896633, 0.8130047, 0.9486398, 0.35597774, 0.41200536, 0.76178575, 0.63848394]
id: 425, tag: tag_25, vector: [0.3203018, 0.8246632, 0.28427872, 0.3969012, 0.94882655, 0.7670139, 0.43087512, 0.36356103]
id: 525, tag: tag_25, vector: [0.52027494, 0.2197635, 0.14136001, 0.081981435, 0.10024931, 0.40981093, 0.92328817, 0.32509744]
id: 625, tag: tag_25, vector: [0.2729753, 0.85121, 0.028014379, 0.32854447, 0.5946417, 0.2831049, 0.6444559, 0.57294136]
id: 725, tag: tag_25, vector: [0.98359156, 0.90887356, 0.26763296, 0.33788496, 0.9277225, 0.4743232, 0.5850919, 0.5116082]
id: 825, tag: tag_25, vector: [0.90271956, 0.31777886, 0.8150854, 0.37264413, 0.756029, 0.75934476, 0.07602229, 0.21065433]
id: 925, tag: tag_25, vector: [0.009773289, 0.352051, 0.8339834, 0.4277803, 0.53999937, 0.2620487, 0.4906858, 0.77002776]
Process finished with exit code 0