使用 OpenCV 在图像中定位表格时遇到问题

Question

致力于从文档中提取文本。文本可以是段落式的，也可以是分段和分区的。

Tesseract 本身在提取文本方面做得非常出色，但正如您从图像中看到的那样，中间部分（红色）是一个不遵循明确定义的结构的表格。我不关心表结构本身。我感兴趣的是将提取的文本保留在其上下文中。因此，逐个细胞似乎是解决方案。

我无法让 OpenCV 检测该表格部分，因此我稍后会裁剪每个单元格并将其与其他单元格进行 OCR 化，以保持文本在上下文中。

这就是我所做的：

import cv2
import pytesseract
import numpy as np
import matplotlib.pyplot as plt

# Load the image (in your case, load from PDF converted image)
# Define paths
RAW_DATA_PATH = '../../data/raw/'
PROCESSED_DATA_PATH = '../../data/processed/'
RESULTS_PATH = '../../data/results/'

image_path = f'{RAW_DATA_PATH}download0.png'

# Adjust this to match the path of your image
image = cv2.imread(image_path, cv2.IMREAD_COLOR)

# Preprocess the image (convert to grayscale, binarize it, etc.)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY_INV)

# Detect horizontal and vertical lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 40))

# Detect horizontal lines
horizontal_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, horizontal_kernel)
# Detect vertical lines
vertical_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, vertical_kernel)

# Combine both horizontal and vertical lines
table_structure = cv2.add(horizontal_lines, vertical_lines)

# Find contours of the table cells
contours, _ = cv2.findContours(table_structure, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Sort contours by position for proper ordering of table cells
contours = sorted(contours, key=lambda x: cv2.boundingRect(x)[1])  # Sort by Y position

# Define a function to extract text from each contour (table cell)
def extract_text_from_contour(image, contour):
    x, y, w, h = cv2.boundingRect(contour)
    cell_image = image[y:y+h, x:x+w]
    text = pytesseract.image_to_string(cell_image, lang='fra', config='--psm 6')
    return text, (x, y, w, h)

# Iterate through each contour and extract text
table_texts = []
for contour in contours:
    text, (x, y, w, h) = extract_text_from_contour(image, contour)
    table_texts.append(text)

    # Optionally, display the cell
    cv2.rectangle(image, (x, y), (x+w, y+h), (0, 255, 0), 2)
    plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    plt.show()

# Display the extracted text for each table cell
for idx, text in enumerate(table_texts):
    print(f"Cell {idx + 1}:")
    print(text)
    print("------")

# Save the processed image with table contours for verification
cv2.imwrite(f'{PROCESSED_DATA_PATH}processed_table_image.png', image)

有什么建议吗？

编辑：

正如许多人所说，该图像不允许他们提供太多帮助，这里有一个质量更好的图像。我的目标是自动检测多个文档中的红色表格（第一张图中）。我的主要目标是为每个单元格单独提供 Tesseract，这样我就不会丢失任何上下文。