致力于从文档中提取文本。文本可以是段落式的,也可以是分段和分区的。
Tesseract 本身在提取文本方面做得非常出色,但正如您从图像中看到的那样,中间部分(红色)是一个不遵循明确定义的结构的表格。我不关心表结构本身。我感兴趣的是将提取的文本保留在其上下文中。因此,逐个细胞似乎是解决方案。
我无法让 OpenCV 检测该表格部分,因此我稍后会裁剪每个单元格并将其与其他单元格进行 OCR 化,以保持文本在上下文中。
这就是我所做的:
import cv2
import pytesseract
import numpy as np
import matplotlib.pyplot as plt
# Load the image (in your case, load from PDF converted image)
# Define paths
RAW_DATA_PATH = '../../data/raw/'
PROCESSED_DATA_PATH = '../../data/processed/'
RESULTS_PATH = '../../data/results/'
image_path = f'{RAW_DATA_PATH}download0.png'
# Adjust this to match the path of your image
image = cv2.imread(image_path, cv2.IMREAD_COLOR)
# Preprocess the image (convert to grayscale, binarize it, etc.)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY_INV)
# Detect horizontal and vertical lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 40))
# Detect horizontal lines
horizontal_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, horizontal_kernel)
# Detect vertical lines
vertical_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, vertical_kernel)
# Combine both horizontal and vertical lines
table_structure = cv2.add(horizontal_lines, vertical_lines)
# Find contours of the table cells
contours, _ = cv2.findContours(table_structure, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# Sort contours by position for proper ordering of table cells
contours = sorted(contours, key=lambda x: cv2.boundingRect(x)[1]) # Sort by Y position
# Define a function to extract text from each contour (table cell)
def extract_text_from_contour(image, contour):
x, y, w, h = cv2.boundingRect(contour)
cell_image = image[y:y+h, x:x+w]
text = pytesseract.image_to_string(cell_image, lang='fra', config='--psm 6')
return text, (x, y, w, h)
# Iterate through each contour and extract text
table_texts = []
for contour in contours:
text, (x, y, w, h) = extract_text_from_contour(image, contour)
table_texts.append(text)
# Optionally, display the cell
cv2.rectangle(image, (x, y), (x+w, y+h), (0, 255, 0), 2)
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
plt.show()
# Display the extracted text for each table cell
for idx, text in enumerate(table_texts):
print(f"Cell {idx + 1}:")
print(text)
print("------")
# Save the processed image with table contours for verification
cv2.imwrite(f'{PROCESSED_DATA_PATH}processed_table_image.png', image)
有什么建议吗?
编辑:
正如许多人所说,该图像不允许他们提供太多帮助,这里有一个质量更好的图像。我的目标是自动检测多个文档中的红色表格(第一张图中)。我的主要目标是为每个单元格单独提供 Tesseract,这样我就不会丢失任何上下文。