不能使用我自己的向量来将你自己的向量带入weaviate。默认为用于创建本地服务器的yml中指定的句子转换器

问题描述 投票:0回答:1

我的本地客户端创建yml

version: '3.4'
services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.25.0
    restart: on-failure:0
    ports:
    - 8080:8080
    - 50051:50051
    environment:
      QUERY_DEFAULTS_LIMIT: 20
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: "./data"
      DEFAULT_VECTORIZER_MODULE: text2vec-transformers
      ENABLE_MODULES: text2vec-transformers
      TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
      CLUSTER_HOSTNAME: 'node1'
  t2v-transformers:
    image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
    environment:
      ENABLE_CUDA: 0

我创建了一个集合:

client.collections.create(name = "legal_sections", 
                          properties = [wvc.config.Property(name = "content",
                                                           description = "The actual section chunk that the answer is to be extracted from",
                                                           data_type = wvc.config.DataType.TEXT,
                                                           index_searchable = True,
                                                           index_filterable = True,
                                                           skip_vectorization = True,
                                                           vectorize_property_name = False)])

我创建要上传的数据,然后上传它:

upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
    upserts.append(wvc.data.DataObject(
        properties = {
            'content':content
        },
        vector = vector
    ))

client.collections.get("Legal_sections").data.insert_many(upserts)

我的自定义向量的长度为 1024

upserts[0].vector.shape
output:
(1024,)

我得到一个随机的 uuid:

coll = client.collections.get("legal_sections")

for i in coll.iterator():
    print(i.uuid)
    break
output:
386be699-71de-4bad-9022-31173b9df8d2

我检查此 uuid 处的该对象存储的向量的长度

coll.query.fetch_object_by_id('386be699-71de-4bad-9022-31173b9df8d2', include_vector=True).vector['default'].__len__()
output:
384

这应该是1024。我做错了什么?

weaviate vector-database
1个回答
0
投票

这很可能是 weaviate 的一个错误(weaviate 的人可以确认)。嵌入模型的嵌入输出具有 dtype

np.float32
的每个元素。

这会导致两个问题:

  1. collections.data.insert
    引发错误,无法 json 序列化 float32
  2. collections.data.insert_many
    只是抑制了这个错误,并简单地使用用于创建客户端的 yml 中给出的模型进行编码

如果我使用

转换嵌入,上面的代码就可以正常工作
vector = [float(i) for i in vector]

也就是说:

upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
    upserts.append(wvc.data.DataObject(
        properties = {
            'content':content
        },
        vector = vector
    ))

转换为

upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
    upserts.append(wvc.data.DataObject(
        properties = {
            'content':content
        },
        vector = [float(i) for i in vector]
    ))
© www.soinside.com 2019 - 2024. All rights reserved.