我正在尝试提取 PDF 数据,这是非常非结构化的。使用这些包(pdf2image、pytesseract、pillow、matplotlib)来获得此输出。
“我在试图以正确的格式提取数据时陷入困境,因为 PDF 边界数据中没有表格,因此数据没有以表格格式呈现。”
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches
# Convert PDF to a list of images
pages = convert_from_path(r'C:\Users\SREE GOPAL P\Desktop\python ocr\111442370.pdf', dpi=300)
# Define a function to apply OCR and get bounding boxes
def ocr_image(image):
text_data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
return text_data
# Apply OCR to each page image
ocr_results = []
for page in pages:
data = ocr_image(page)
ocr_results.append(data)
# Function to visualize text layout on the image
def visualize_layout(image, data):
plt.figure(figsize=(10, 10))
plt.imshow(image)
for i in range(len(data['text'])):
if int(data['conf'][i]) > 0: # Filter out low-confidence results
x, y, w, h = (data['left'][i], data['top'][i],
data['width'][i], data['height'][i])
rect = patches.Rectangle((x, y), w, h, linewidth=1, edgecolor='r', facecolor='none')
plt.gca().add_patch(rect)
4text(x, y, data['text'][i], fontsize=7, color='blue')
plt.axis('on')
plt.show()
# Visualize each page
for page, data in zip(pages, ocr_results):
visualize_layout(page, data)
在进行 OCR 时,可能会因非结构化格式而产生噪音,例如文本错位和不准确。例如,一行单词可能被几个框分隔开,并且它们是相邻的。在这种情况下,您的代码中缺少一些文本。有时,文本框位置的变化可能会导致间隙或重叠。因此,您需要一个函数来处理这种情况 - 将相邻的边界框合并为一个。
功能如下:
def merge_bounding_boxes(data, threshold=10):
merged_data = []
current_box = data[0]
for next_box in data[1:]:
x, y, w, h, text = current_box
nx, ny, nw, nh, ntext = next_box
# Check if the current and next boxes overlap or are adjacent
if (nx <= x + w + threshold and ny <= y + h + threshold and x <= nx + nw + threshold and y <= ny + nh + threshold):
new_x = min(x, nx)
new_y = min(y, ny)
new_w = max(x + w, nx + nw) - new_x
new_h = max(y + h, ny + nh) - new_y
new_text = text + " " + ntext
current_box = (new_x, new_y, new_w, new_h, new_text)
else:
merged_data.append(current_box)
current_box = next_box
merged_data.append(current_box)
return merged_data
通过合并,您可以最大限度地减少错位和不准确的影响,从而改善提取结果。