我目前正在使用 llama_index Python 包,并使用 llama-index-vector-stores-timescalevector 扩展来通过 Timescale 管理我的向量。然而,我遇到了一个问题,我无法存储索引以供将来使用,这意味着我每次运行代码时都必须重新创建它。这是非常低效的,对于我的用例来说并不理想。
我遵循了这个教程:TimescaleVector 示例,但它没有提到如何存储和稍后加载索引。
这是我的代码设置的片段。 csv 可在此 link
获取pip install llama_index llama-index-vector-stores-postgres llama-index-embeddings-openai llama-index-vector-stores-timescalevector
import llama_index
from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex
from llama_index.core.vector_stores import VectorStoreQuery, MetadataFilters
from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo
from llama_index.vector_stores.timescalevector import TimescaleVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
import pandas as pd
import os
import time
from datetime import datetime, timedelta
# API keys and paths hidden for security
os.environ["OPENAI_API_KEY"] = 'your_openai_api_key'
os.environ["TIMESCALE_SERVICE_URL"] = 'your_timescale_service_url'
# Load and process data
reuters = pd.read_csv('your_file_path')
reuters.columns = ["title", "date", "description"]
# Function to take in a date string in the past and return a uuid v1
def create_uuid2(date_string: str):
if date_string is None:
return None
time_format = '%b %d %Y'
datetime_obj = datetime.strptime(date_string, time_format)
uuid = timescale_client.uuid_from_time(datetime_obj)
return str(uuid)
def create_date2(input_string: str) -> datetime:
if input_string is None:
return None
# Convert the string to a datetime object using strptime
date_object = datetime.strptime(input_string, '%b %d %Y')
# Define the time as midnight and the desired timezone offset
time = "00:00:00"
timezone_hours = 8
timezone_minutes = 50
# Create the formatted string
timestamp_tz_str = f"{date_object.year}-{date_object.month:02}-{date_object.day:02} {time}+{timezone_hours:02}{timezone_minutes:02}"
return timestamp_tz_str
# Create a Node object from a single row of data
def create_node2(row):
record = row.to_dict()
record_content = (
record["date"]
+ " "
+ record["title"]
+ " "
+ record["description"]
)
# Can change to TextNode as needed
node = TextNode(
id_=create_uuid2(str(record["date"])),
text=record_content,
metadata={
"title": record["title"],
"date": create_date2(str(record["date"])),
},
)
return node
# Create nodes and embeddings
nodes = [create_node2(row) for _, row in reuters.iterrows()]
embedding_model = OpenAIEmbedding()
# Add nodes to Timescale Vector Store
ts_vector_store = TimescaleVectorStore.from_params(
service_url=os.environ["TIMESCALE_SERVICE_URL"],
table_name="reuters_test"
)
_ = ts_vector_store.add(nodes[:100])
# Tried with this function. It runs but I don't know where the index is saved
ts_vector_store.create_index("aaa")
# Also, attempt to store the index (currently not working as expected)
storage_context = StorageContext.from_defaults(persist_dir="your_persist_dir")
index.storage_context.persist(persist_dir="your_persist_dir") #not clear how to retrieve the index variable
from llama_index.core import load_index_from_storage
# load a single index
# need to specify index_id if multiple indexes are persisted to the same directory
index = load_index_from_storage(storage_context)
这是我在使用函数load_index_from_storage
时遇到的错误KeyError Traceback(最近一次调用最后一次)
在
4帧 /usr/local/lib/python3.10/dist-packages/llama_index/core/storage/storage_context.py 位于 vector_store(self) [第 262 章] 第263章 --> 264 返回 self.vector_stores[DEFAULT_VECTOR_STORE] 265 第266章
KeyError:“默认”有人有 llama-index-vector-stores-timescalevector 包的经验吗?如何正确存储和重新加载索引以避免每次都重新创建索引?任何有关正确方法或任何相关文档的指导将不胜感激。
我希望能够存储索引并稍后重新加载它,而无需从头开始重新创建它。
查询现有索引。
总而言之,VectorStoreIndex
是缺失的部分。
ts_vector_store = TimescaleVectorStore.from_params(
service_url=os.environ["TIMESCALE_SERVICE_URL"],
table_name="reuters_test"
)
index = VectorStoreIndex.from_vector_store(vector_store=ts_vector_store)
query_engine = index.as_query_engine()
response = query_engine.query("My question here")