从 pdf 中提取非结构化发票数据

Question

我正在尝试提取 PDF 数据，这是非常非结构化的。使用这些包（pdf2image、pytesseract、pillow、matplotlib）来获得此输出。

“我在试图以正确的格式提取数据时陷入困境，因为 PDF 边界数据中没有表格，因此数据没有以表格格式呈现。”

from pdf2image import convert_from_path
import pytesseract
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches

# Convert PDF to a list of images
pages = convert_from_path(r'C:\Users\SREE GOPAL P\Desktop\python ocr\111442370.pdf', dpi=300)

# Define a function to apply OCR and get bounding boxes
def ocr_image(image):
    text_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
    return text_data

# Apply OCR to each page image
ocr_results = []
for page in pages:
    data = ocr_image(page)
    ocr_results.append(data)

# Function to visualize text layout on the image
def visualize_layout(image, data):
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    
    for i in range(len(data['text'])):
        if int(data['conf'][i]) > 0:  # Filter out low-confidence results
            x, y, w, h = (data['left'][i], data['top'][i],
                          data['width'][i], data['height'][i])
            rect = patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='r', facecolor='none')
            plt.gca().add_patch(rect)
            4text(x, y, data['text'][i], fontsize=7, color='blue')

    plt.axis('on')
    plt.show()

# Visualize each page
for page, data in zip(pages, ocr_results):
    visualize_layout(page, data)

Answer 1

在进行 OCR 时，可能会因非结构化格式而产生噪音，例如文本错位和不准确。例如，一行单词可能被几个框分隔开，并且它们是相邻的。在这种情况下，您的代码中缺少一些文本。有时，文本框位置的变化可能会导致间隙或重叠。因此，您需要一个函数来处理这种情况 - 将相邻的边界框合并为一个。

功能如下：

def merge_bounding_boxes(data, threshold=10):

    merged_data = []
    current_box = data[0]

    for next_box in data[1:]:
        x, y, w, h, text = current_box
        nx, ny, nw, nh, ntext = next_box

        # Check if the current and next boxes overlap or are adjacent
        if (nx <= x + w + threshold and ny <= y + h + threshold and x <= nx + nw + threshold and y <= ny + nh + threshold):
            new_x = min(x, nx)
            new_y = min(y, ny)
            new_w = max(x + w, nx + nw) - new_x
            new_h = max(y + h, ny + nh) - new_y
            new_text = text + " " + ntext
            current_box = (new_x, new_y, new_w, new_h, new_text)
        else:
            merged_data.append(current_box)
            current_box = next_box

    merged_data.append(current_box)
    return merged_data

通过合并，您可以最大限度地减少错位和不准确的影响，从而改善提取结果。

从 pdf 中提取非结构化发票数据

问题描述投票：0回答：1

1个回答

最新问题

从 pdf 中提取非结构化发票数据

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1