存储和加载索引时标向量 Llama 索引的问题

问题描述 投票:0回答:1

我目前正在使用 llama_index Python 包,并使用 llama-index-vector-stores-timescalevector 扩展来通过 Timescale 管理我的向量。然而,我遇到了一个问题,我无法存储索引以供将来使用,这意味着我每次运行代码时都必须重新创建它。这是非常低效的,对于我的用例来说并不理想。

我遵循了这个教程:TimescaleVector 示例,但它没有提到如何存储和稍后加载索引。

这是我的代码设置的片段。 csv 可在此 link

获取
pip install llama_index llama-index-vector-stores-postgres llama-index-embeddings-openai llama-index-vector-stores-timescalevector

import llama_index
from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex
from llama_index.core.vector_stores import VectorStoreQuery, MetadataFilters
from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo
from llama_index.vector_stores.timescalevector import TimescaleVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
import pandas as pd
import os
import time
from datetime import datetime, timedelta

# API keys and paths hidden for security
os.environ["OPENAI_API_KEY"] = 'your_openai_api_key'
os.environ["TIMESCALE_SERVICE_URL"] = 'your_timescale_service_url'

# Load and process data
reuters = pd.read_csv('your_file_path')
reuters.columns = ["title", "date", "description"]

# Function to take in a date string in the past and return a uuid v1
def create_uuid2(date_string: str):
    if date_string is None:
        return None
    time_format = '%b %d %Y'
    datetime_obj = datetime.strptime(date_string, time_format)
    uuid = timescale_client.uuid_from_time(datetime_obj)
    return str(uuid)

def create_date2(input_string: str) -> datetime:
    if input_string is None:
        return None
    # Convert the string to a datetime object using strptime
    date_object = datetime.strptime(input_string, '%b %d %Y')

    # Define the time as midnight and the desired timezone offset
    time = "00:00:00"
    timezone_hours = 8
    timezone_minutes = 50

    # Create the formatted string
    timestamp_tz_str = f"{date_object.year}-{date_object.month:02}-{date_object.day:02} {time}+{timezone_hours:02}{timezone_minutes:02}"
    return timestamp_tz_str



# Create a Node object from a single row of data
def create_node2(row):
    record = row.to_dict()
    record_content = (
        record["date"]
        + " "
        + record["title"]
        + " "
        + record["description"]
    )
    # Can change to TextNode as needed
    node = TextNode(
        id_=create_uuid2(str(record["date"])),
        text=record_content,
        metadata={
            "title": record["title"],
            "date": create_date2(str(record["date"])),
        },
    )

    return node


# Create nodes and embeddings
nodes = [create_node2(row) for _, row in reuters.iterrows()]
embedding_model = OpenAIEmbedding()

# Add nodes to Timescale Vector Store
ts_vector_store = TimescaleVectorStore.from_params(
    service_url=os.environ["TIMESCALE_SERVICE_URL"],
    table_name="reuters_test"
)
_ = ts_vector_store.add(nodes[:100])

# Tried with this function. It runs but I don't know where the index is saved
ts_vector_store.create_index("aaa")
# Also, attempt to store the index (currently not working as expected)
storage_context = StorageContext.from_defaults(persist_dir="your_persist_dir")
index.storage_context.persist(persist_dir="your_persist_dir") #not clear how to retrieve the index variable

from llama_index.core import load_index_from_storage

# load a single index
# need to specify index_id if multiple indexes are persisted to the same directory
index = load_index_from_storage(storage_context)

这是我在使用函数load_index_from_storage

时遇到的错误

KeyError Traceback(最近一次调用最后一次) 在 () 中 4 从存储加载图, 5) ----> 6 索引 = load_index_from_storage(storage_context)

4帧 /usr/local/lib/python3.10/dist-packages/llama_index/core/storage/storage_context.py 位于 vector_store(self) [第 262 章] 第263章 --> 264 返回 self.vector_stores[DEFAULT_VECTOR_STORE] 265 第266章

KeyError:“默认”

有人有 llama-index-vector-stores-timescalevector 包的经验吗?如何正确存储和重新加载索引以避免每次都重新创建索引?任何有关正确方法或任何相关文档的指导将不胜感激。

我希望能够存储索引并稍后重新加载它,而无需从头开始重新创建它。

storage timescaledb llama-index vectorstore llm-sql-generation
1个回答
0
投票
您可以使用此链接

查询现有索引

总而言之,

VectorStoreIndex

是缺失的部分。

ts_vector_store = TimescaleVectorStore.from_params( service_url=os.environ["TIMESCALE_SERVICE_URL"], table_name="reuters_test" ) index = VectorStoreIndex.from_vector_store(vector_store=ts_vector_store) query_engine = index.as_query_engine() response = query_engine.query("My question here")
    
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.