如何使用 llama-parse 获得表格 PDF 解析的一致结果?

问题描述 投票:0回答:1

我正在使用 Python 中的 llama 解析一些 PDF 文件,代码如下:

import os
import pandas as pd

import nest_asyncio 
nest_asyncio.apply()

os.environ["LLMA_CLOUD_API_KEY"] = "some_key_id"
key_input = "some_key_id"

from llama_parse import LlamaParse

# running llama parsing
doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                        ).load_data(r"Path\myfile.pdf")
当我现在运行相同的代码时,解析相同文档的结果是不同的。区别在于表格文本中分隔的

| 和行分隔。

有没有办法在 llama 中获得相同的旧结果或修复一些参数,以便它在相同的模型或相同的方式上工作,始终一次又一次地获得相同的一致结果,以便我可以基于相同的代码逻辑在此基础上构建分析?

上个月的美洲驼结果:

print(doc_parsed[5].text[:1000])

# Information |Name|: Mr. XXX| |---|---| |Age/Sex|: XX YRS/M| |Lab Id.|: 0124080X| |Refered By|: Self| |Sample Collection On|: 03/Aug/2024 08:30AM| |Collected By|: XXX| |Sample Lab Rec. On|: 03/Aug/2024 11:50 AM| |Collection Mode|: HOME COLLECTION| |Reporting On|: 03/Aug/2024 02:48 PM| |BarCode|: XXX| # Test Results |Test Name|Result|Biological Ref. Int.|Unit| |---|---|---|---|
Llama 现在在同一个 PDF 上结果:
print(doc_parsed[5].text[:1000])

# Report Name: Mr. XXX Age/Sex: XXX YRS/M Lab Id: 0124080X Referred By: Self Sample Collection On: 03/Aug/2024 08:30 AM Collected By: XXX Sample Lab Rec. On: 03/Aug/2024 11:50 AM Collection Mode: HOME COLLECTION Reporting On: 03/Aug/2024 02:48 PM BarCode: XXX # Test Results Test Name Result Biological Ref. Int. Unit
期望的结果:
# Above part doesn't matter but Test Results should be separated by |
# Test Results

|Test Name|Result|Biological Ref. Int.|Unit|

后面的型号是否有变化导致差异?我可以修复模型以获得一致的结果吗?
    

通过添加
python pdf pdf-parsing llama-parse
1个回答
0
投票
并要求它

创建由

|
分隔的表格数据,提供了某种形式的帮助来创建所需的结果,但我不确定使用
instructions
结果是否会随着时间的推移保持一致.
Other Answers are also welcome and I am open for better Answers.

# parsing instruction
parsingInstruction2 = """The provided document is a Report.
It should contain tables.
Try to reconstruct the table data into four columns each seperated by |."""

# parse function
doc_parsed_13Sep2 = LlamaParse(result_type="markdown",api_key=key_input, 
                                  parsing_instruction=parsingInstruction2
                        ).load_data(r"Path\myfile.pdf")

输出:

# Report table { width: 100%; border-collapse: collapse; } th, td { border: 1px solid black; padding: 8px; text-align: left; } th { background-color: #f2f2f2; } Name: Mr. XXX Age/Sex: XXX YRS/M Lab Id: 0124080X Referred By: Self Sample Collection On: 03/Aug/2024 08:30AM Collected By: XXX Sample Lab Rec. On: 03/Aug/2024 11:50 AM Collection Mode: HOME COLLECTION Reporting On: 03/Aug/2024 03:24 PM BarCode: XXX # Test Results |Test Name|Result|Biological Ref. Int.|Unit| |---|---|---|---| |BLOOD UREA|31.80|12-43|mg/dL| |BLOOD UREA NITROGEN (BUN)|15|6 - 21|mg/dl| |SERUM CREATININE|1.10|0.9 - 1.3|mg/dL| |SERUM URIC ACID|5.8|3.5-7.2|mg/dL| |UREA / CREATININE RATIO|28.91|23 - 33|Ratio| |BUN / CREATININE RATIO|13.51|5.5 - 19.2|Ratio| |INORGANIC PHOSPHORUS|3.63|2.5-4.5|mg/dL|

更新
- 更新说明

fields separated by |

parsingInstruction3 = """The provided document is a Report.
It should contain tables.
Try to reconstruct the data with fields seperated by |."""

输出:
# TEST REPORT

|Name|Mr. XXX|
|---|---|
|Age/Sex|XXX YRS/M|
|Lab Id.|0124080X|
|Referred By|Self|
|Sample Collection On|03/Aug/2024 08:30 AM|
|Collected By|XXX|
|Sample Lab Rec. On|03/Aug/2024 11:50 AM|
|Collection Mode|HOME COLLECTION|
|Reporting On|03/Aug/2024 03:24 PM|
|BarCode|XXX|

# Test Results

|Test Name|Result|Biological Ref. Int.|Unit|
|---|---|---|---|
|BLOOD UREA|31.80|12-43|mg/dL|
|BLOOD UREA NITROGEN (BUN)|15|6 - 21|mg/dL|
|SERUM CREATININE|1.10|0.9 - 1.3|mg/dL|
|SERUM URIC ACID|5.8|3.5-7.2|mg/dL|
|UREA / CREATININE RATIO|28.91|23 - 33|Ratio|
|BUN / CREATININE RATIO|13.51|5.5 - 19.2|Ratio|
|INORGANIC PHOSPHORUS|3.63|2.5-4.5|mg/dL|


© www.soinside.com 2019 - 2024. All rights reserved.