如何让Gemini 1.5从图像中提取表格数据?

问题描述 投票:0回答:1

我需要从图像中包含列但没有行的表中提取成对的代码和描述,如下所示:

table with four columns in Spanish

我尝试了 Gemini 1.5 Flash,为聊天提供了图像和相应的提示,它成功地提取了代码-描述对:

actual prompt and answer from Gemini 1.5 Flash

当我尝试创建一个具有相同功能的Python程序时,我只找到了从图像中提取文本的文档,然后将文本传递给LLM以找出如何配对代码(在“CÓDIGO”列下)和描述(在“DESIGNACIÓN DE LA MECANCÍA”下)。但由于文本脱离了上下文,模型不可能单独从文本中找出配对:

import io
import os
from google.cloud import aiplatform
from google.cloud import language_v1
from google.cloud.vision_v1 import ImageAnnotatorClient

service_account_path = "my_key.json"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = service_account_path
# Authenticate using the service account key
aiplatform.init(project="my_project", location="us-central1", 
                credentials=service_account_path) 

def process_image(image_path):
    client = ImageAnnotatorClient()
    with io.open(image_path, 'rb') as image_file:
        content = image_file.read()
    image = client.document_text_detection(image=content)
    text = image.text
    prompt = f"""
    The following text is extracted from a table in Spanish. 
    The table has columns: "CÓDIGO", "DESIGNACIÓN DE LA MERCANCÍA". 
    The "DESIGNACIÓN DE LA MERCANCÍA" column often starts with a hyphen "-". 

    Extract the data from the text and present it as a list of dictionaries, 
    where each dictionary has the following structure:
    {{"CÓDIGO": "code_value", "DESIGNACIÓN DE LA MERCANCÍA": "description"}}

    Text: 
    {text}
    """

    client = language_v1.LanguageServiceClient()
    document = language_v1.Document(
        content=prompt, type_=language_v1.Document.Type.PLAIN_TEXT
    )
    response = client.analyze_sentiment(document=document)
    return response.document_sentiment.text.split("\n")[1:-1]

有没有办法让这个或另一个 GenAI 接收图像并处理提示以提取数据,就像聊天版本的 Gemini 1.5 Flash 那样?

python google-gemini
1个回答
0
投票

事实证明我没有使用最新的库来与 Gemini 交互。我应该使用“google-generativeai”而不是“google-cloud-*”。

由于新模型是多模态的,您可以上传图像并直接使用它们作为模型请求的输入,如下所示:

import os
import PIL.Image
import google.generativeai as genai

with open('GEMINI_API_KEY.txt', 'r') as file:
    os.environ['GEMINI_API_KEY'] = file.read()

genai.configure(api_key=os.environ['GEMINI_API_KEY'])

image_path = '/path/images'
image_path_1 = f"{image_path}/01.png" 
image_path_2 = f"{image_path}/02.png"

sample_file_1 = PIL.Image.open(image_path_1)
sample_file_2 = PIL.Image.open(image_path_2)

#Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

prompt = "..."
# Pass the prompt and the images
response = model.generate_content([prompt, sample_file_1, sample_file_2])

print(response.text)
© www.soinside.com 2019 - 2024. All rights reserved.