为什么在 Python 中使用 OpenAi 嵌入时我的维度不同？

Question

我有一个 Python 函数，我正在使用不同长度的嵌入 JSON 对象。我遇到的问题是，不知何故，比较向量时尺寸不同，我不知道为什么。首先，这是我的嵌入函数：

def get_embeddings(json_object: json) -> list:
    json_splitter = RecursiveJsonSplitter(max_chunk_size=2000)
    json_docs = json_splitter.split_json(json_object, True)
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=3072)
    total_embeddings = []
    for json_doc in json_docs:
        vector_results = embeddings.embed_query(json.dumps(json_doc))
        if vector_results is not None:
            for vector in vector_results:
                total_embeddings.append(vector)
    return total_embeddings

然后，我通过以下调用将这些嵌入保存在 JSON 对象中：

json_object["embeddings'] = get_embeddings(input_json)

我用numpy写了一个相似度方法，如下：

def get_similarity_score(vector_set1, vector_set2) -> float:
    # Convert the vector sets to numpy arrays
    vector_set1 = np.array(vector_set1)
    vector_set2 = np.array(vector_set2)

    # Calculate the cosine similarity between the two vector sets
    dot_product = np.dot(vector_set1, vector_set2)
    norm1 = np.linalg.norm(vector_set1)
    norm2 = np.linalg.norm(vector_set2)
    similarity_score = dot_product / (norm1 * norm2)

    # Map the similarity score to a range of 0 to 10
    similarity_score = (similarity_score + 1) * 5

    # Round the similarity score to two decimal places
    similarity_score = round(similarity_score, 2)

    return similarity_score

我通过诸如

之类的调用来调用该方法

this_score = get_similarity_score(json_object1["embeddings"], json_object2["embeddings"])

这给了我错误：

ValueError: shapes (30720,) and (21504,) not aligned: 30720 (dim 0) != 21504 (dim 0)

我的 JSON 对象又长又复杂，所以我尝试创建自己的 JSON，它更简单，但遵循模式 list[dict[str, dict]]。那没有用。

我尝试过使用 ChromaDB 和 Weaviate 等向量存储，但问题仍然存在。

我相当确定我以某种方式搞砸了嵌入，这导致了尺寸差异，但我不知道如何修复它。

有人有什么想法吗？

谢谢！

这里是主题列表的链接：

https://www.dropbox.com/scl/fi/6bcsu1t10o8zj1f8mz4y5/Topics.txt?rlkey=xfznwo7pwtrwixcs2cnwcqx1b&st=hwvhptnq&dl=0

我首先通过嵌入运行每个函数，然后运行 get_similarity 函数。

我尝试了 np.reshape 但出现了数组无法调整大小的错误。本文 -

无法将大小数组重塑为形状 - 解释了该错误以及为什么不能选择重塑。我认为 get_embeddings 中的向量数组导致了问题，这意味着我需要以某种方式将其强制转换为统一数组。有任何想法吗？谢谢你！

Answer 1

事实证明答案非常简单 - 我只是使用 NUMPY 调整大小。首先，我创建了所有嵌入，然后我浏览并找到了平均大小，并使用调整大小来使所有嵌入达到该大小。

问题解决了！

为什么在 Python 中使用 OpenAi 嵌入时我的维度不同？

问题描述投票：0回答：1

1个回答

最新问题

为什么在 Python 中使用 OpenAi 嵌入时我的维度不同？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1