我需要从图像中包含列但没有行的表中提取成对的代码和描述,如下所示:
我尝试了 Gemini 1.5 Flash,为聊天提供了图像和相应的提示,它成功地提取了代码-描述对:
当我尝试创建一个具有相同功能的Python程序时,我只找到了从图像中提取文本的文档,然后将文本传递给LLM以找出如何配对代码(在“CÓDIGO”列下)和描述(在“DESIGNACIÓN DE LA MECANCÍA”下)。但由于文本脱离了上下文,模型不可能单独从文本中找出配对:
import io
import os
from google.cloud import aiplatform
from google.cloud import language_v1
from google.cloud.vision_v1 import ImageAnnotatorClient
service_account_path = "my_key.json"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = service_account_path
# Authenticate using the service account key
aiplatform.init(project="my_project", location="us-central1",
credentials=service_account_path)
def process_image(image_path):
client = ImageAnnotatorClient()
with io.open(image_path, 'rb') as image_file:
content = image_file.read()
image = client.document_text_detection(image=content)
text = image.text
prompt = f"""
The following text is extracted from a table in Spanish.
The table has columns: "CÓDIGO", "DESIGNACIÓN DE LA MERCANCÍA".
The "DESIGNACIÓN DE LA MERCANCÍA" column often starts with a hyphen "-".
Extract the data from the text and present it as a list of dictionaries,
where each dictionary has the following structure:
{{"CÓDIGO": "code_value", "DESIGNACIÓN DE LA MERCANCÍA": "description"}}
Text:
{text}
"""
client = language_v1.LanguageServiceClient()
document = language_v1.Document(
content=prompt, type_=language_v1.Document.Type.PLAIN_TEXT
)
response = client.analyze_sentiment(document=document)
return response.document_sentiment.text.split("\n")[1:-1]
有没有办法让这个或另一个 GenAI 接收图像并处理提示以提取数据,就像聊天版本的 Gemini 1.5 Flash 那样?
事实证明我没有使用最新的库来与 Gemini 交互。我应该使用“google-generativeai”而不是“google-cloud-*”。
由于新模型是多模态的,您可以上传图像并直接使用它们作为模型请求的输入,如下所示:
import os
import PIL.Image
import google.generativeai as genai
with open('GEMINI_API_KEY.txt', 'r') as file:
os.environ['GEMINI_API_KEY'] = file.read()
genai.configure(api_key=os.environ['GEMINI_API_KEY'])
image_path = '/path/images'
image_path_1 = f"{image_path}/01.png"
image_path_2 = f"{image_path}/02.png"
sample_file_1 = PIL.Image.open(image_path_1)
sample_file_2 = PIL.Image.open(image_path_2)
#Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")
prompt = "..."
# Pass the prompt and the images
response = model.generate_content([prompt, sample_file_1, sample_file_2])
print(response.text)