我是编码新手,并尝试在 python 文件中使用我的自定义提取模板模型来阅读更多文档。我使用 Document Intelligence Studio 制作了模型,应用了我自己的标签,对其进行了训练和测试,它似乎工作得很好。我尝试使用模型 ID 使用 python 应用程序从本地文件夹中读取 pdf,但它没有分析带有模型标签的文档。相反,它似乎使用了我不想要的自动标记,因为那时我会使用示例模型。是我的代码有问题吗?或者也许是别的什么?就是这样(我主要使用了Document Intelligence Studio提供的代码,也许我应该改变它?):
endpoint = "<myendpoint>"
key = "<mykey>"
model_id = "<my-custom-model-id>"
formUrl = "<pdf-url>"
document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)
# Initialize the DocumentAnalysisClient with endpoint and key
document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)
# Make sure your document's type is included in the list of document types the custom model can analyze
poller = document_analysis_client.begin_analyze_document_from_url(model_id, formUrl)
result = poller.result()
for idx, document in enumerate(result.documents):
print("--------Analyzing document #{}--------".format(idx + 1))
print("Document has type {}".format(document.doc_type))
print("Document has confidence {}".format(document.confidence))
print("Document was analyzed by model with ID {}".format(result.model_id))
for name, field in document.fields.items():
field_value = field.value if field.value else field.content
print("......found field of type '{}' with value '{}' and with confidence {}".format(field.value_type, field_value, field.confidence))
# iterate over tables, lines, and selection marks on each page
for page in result.pages:
print("\nLines found on page {}".format(page.page_number))
for line in page.lines:
print("...Line '{}'".format(line.content.encode('utf-8')))
for word in page.words:
print(
"...Word '{}' has a confidence of {}".format(
word.content.encode('utf-8'), word.confidence
)
)
for selection_mark in page.selection_marks:
print(
"...Selection mark is '{}' and has a confidence of {}".format(
selection_mark.state, selection_mark.confidence
)
)
for i, table in enumerate(result.tables):
print("\nTable {} can be found on page:".format(i + 1))
for region in table.bounding_regions:
print("...{}".format(i + 1, region.page_number))
for cell in table.cells:
print(
"...Cell[{}][{}] has content '{}'".format(
cell.row_index, cell.column_index, cell.content.encode('utf-8')
)
)
print("-----------------------------------")
我尝试更改 Document Intelligence Studio 上的模型类型,但没有任何帮助,所以我认为它必须对我的代码做一些事情......也许我必须指定标签?但不知道...
使用特定标签训练自定义模型,它应该能够处理所训练的数据类型。检查 Document Intelligence Studio 中的标签与正在分析的 PDF 的数据结构是否匹配。
我在 Document Intelligence Studio 中训练了一个自定义模型来识别这些键值对。
代码:
import os
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
# Replace with your endpoint and key
endpoint = "https://doc inexxx.cognitiveservices.azure.com/"
key = "699a1e68fxxxxxxxxxxxx683a44d"
model_id = "testmodel2"
# Initialize the client
client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
# Function to analyze a document using the custom model
def analyze_document(file_path):
with open(file_path, "rb") as f:
poller = client.begin_analyze_document(model_id=model_id, document=f)
result = poller.result()
print("Full Analysis Result:")
print(result)
for page in result.pages:
print(f"Page number: {page.page_number}")
if hasattr(page, 'key_value_pairs'):
print("Key-Value Pairs:")
for kv_pair in page.key_value_pairs:
key = kv_pair.key.content if kv_pair.key else "None"
value = kv_pair.value.content if kv_pair.value else "None"
print(f" Key: {key}, Value: {value}")
if hasattr(page, 'lines'):
print("Lines:")
for line in page.lines:
print(f" {line.content}")
if hasattr(page, 'words'):
print("Words:")
for word in page.words:
print(f" Word: {word.content}, Confidence: {word.confidence}")
if hasattr(page, 'selection_marks'):
print("Selection Marks:")
for selection_mark in page.selection_marks:
print(f" Selection Mark: {selection_mark.state}, Confidence: {selection_mark.confidence}")
if hasattr(page, 'tables'):
print("Tables:")
for table in page.tables:
for cell in table.cells:
print(f" Cell[{cell.row_index}][{cell.column_index}]: {cell.content}")
# Path to your single PDF file
pdf_path = r"C:\Users\v-chikkams\Downloads\Get_Started_With_Smallpdf.pdf"
# Analyze the document
print(f"Analyzing document: {pdf_path}")
analyze_document(pdf_path)
定制训练模型:
输出: