如何使用 Python 和 Tesseract 改进低对比度和模糊报纸图像的 OCR 提取？

Question

我正在开发一个 Django 应用程序，用于从剪报图像中提取文本。这些图像通常对比度低且模糊，并且包含各种文本块，例如标题、日期、小文本、粗体文本和一些模糊的字母。图像尺寸约为 256x350 像素，分辨率为 96 dpi。

在将图像输入 Tesseract OCR 之前，我尝试了多种预处理技术来增强图像，但仍然没有得到满意的结果。这是我当前的代码：

import os
from django.shortcuts import render
from django.core.files.storage import FileSystemStorage
import pytesseract
import cv2
from PIL import Image
import numpy as np
import re

# Set Tesseract command
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

def clean_text(text):
    # Example of cleaning common OCR errors
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^\w\s,.!?;:()"\']', '', text)  # Keep common punctuation
    return text.strip()

def home(request):
    return render(request, 'layout/index.html')

def preprocess_image(image_path):
    # Read the image
    image = cv2.imread(image_path, cv2.IMREAD_COLOR)
    if image is None:
        raise ValueError(f"Error loading image: {image_path}")
    
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
    # Histogram equalization to enhance contrast
    gray = cv2.equalizeHist(gray)
    
    # Increase contrast further if necessary
    gray = cv2.convertScaleAbs(gray, alpha=2.5, beta=50)
    
    # Resize the image to increase readability (aggressively)
    height, width = gray.shape
    scale_factor = 2  # Increase the scale factor for more aggressive resizing
    new_width = int(width * scale_factor)
    new_height = int(height * scale_factor)
    resized_image = cv2.resize(gray, (new_width, new_height), interpolation=cv2.INTER_CUBIC)
    
    # Apply Gaussian Blur to remove noise
    blurred_image = cv2.GaussianBlur(resized_image, (3, 3), 0)
    
    # Apply adaptive thresholding and combine with Otsu's method
    _, binary_otsu = cv2.threshold(blurred_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    binary_adaptive = cv2.adaptiveThreshold(blurred_image, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
    
    # Combine the results
    binary = cv2.bitwise_and(binary_adaptive, binary_otsu)
    
    # Sharpen the image
    kernel_sharpening = np.array([[0, -1, 0], 
                                  [-1, 5, -1],
                                  [0, -1, 0]])
    binary = cv2.filter2D(binary, -1, kernel_sharpening)
    
    # Apply dilation and erosion to remove noise
    kernel = np.ones((3, 3), np.uint8)
    binary = cv2.dilate(binary, kernel, iterations=1)
    binary = cv2.erode(binary, kernel, iterations=1)
    
    return binary

def upload(request):
    if request.method == 'POST' and request.FILES.get('document'):
        document = request.FILES['document']
        fs = FileSystemStorage()
        filename = fs.save(document.name, document)
        uploaded_file_url = fs.url(filename)

        # Preprocess the uploaded document
        try:
            preprocessed_image = preprocess_image(fs.path(filename))
        except ValueError as e:
            return render(request, 'layout/index.html', {'error': str(e)})

        # Convert the processed image to PIL format
        preprocessed_image_pil = Image.fromarray(preprocessed_image)

        # Extract text using Tesseract
        custom_config = r'--oem 3 --psm 12 --dpi 1500'
        text = pytesseract.image_to_string(preprocessed_image_pil, config=custom_config)

        # Clean the extracted text
        text = clean_text(text)

        context = {
            'uploaded_file_url': uploaded_file_url,
            'text': text,
        }
        return render(request, 'layout/index.html', context)
    return render(request, 'layout/index.html')
<!DOCTYPE html>
<html>
<head>
    <title>Document Layout Detection</title>
</head>
<body>
    <h1>Upload Document</h1>
    <form method="post" enctype="multipart/form-data" action="{% url 'upload' %}">
        {% csrf_token %}
        <input type="file" name="document">
        <button type="submit">Upload</button>
    </form>

    {% if uploaded_file_url %}
        <h2>Uploaded Document:</h2>
        <img src="{{ uploaded_file_url }}" alt="Document">
        <h2>Extracted Text:</h2>
        <pre>{{ text }}</pre>
    {% endif %}
</body>
</html>
from django.urls import path
from . import views

urlpatterns are [
    path('', views.home, name='home'),
    path('upload/', views.upload, name='upload'),
]

Answer 1

这是一个合理的启发，打印文档的大部分纸张是白色的（或者旧材料是黄色的，带有橙色斑点）。因此，如果您有可用的彩色扫描，那么单独的红色通道通常是更好的更高对比度起点。绿色通道总是值得一看，但蓝色通道通常只会增加噪音和狐臭痕迹并降低对比度。

为了使文本更具可读性，“均衡直方图”并不是您真正想要对单色图像执行的操作。您实际想要做的是将白色附近顶端的所有直方图峰值粉碎为 255，将底端的较小峰值粉碎为 0，然后将 N 个剩余的中间值线性分配给 (255*k)/N。截止的精确选择并不敏感，但此步骤会破坏半色调图像以使文本对比度更加清晰。

如果这还不够，并且文本仍然模糊，那么与根据图像经验确定的模糊相匹配的不锐化掩模通常效果最好，而不是粗略的“锐化”内核。

您应该能够避免完全对图像进行高斯模糊（由于均衡直方图的副作用，仅在工作流程中需要此步骤）。如果确实需要任何进一步的噪声预处理，去除椒盐噪声或中值滤波器或边缘保持平滑都是更好的选择。

高斯模糊可以帮助使扫描的半色调图像变得可以接受。

有时，在经历过战争的特别混乱的页面上，一种去噪方法比其他任何方法都效果更好。上面的工作流程应该适用于大多数材料，但一些尴尬的情况需要额外的手动干预才能获得最佳结果。

如何使用 Python 和 Tesseract 改进低对比度和模糊报纸图像的 OCR 提取？

问题描述投票：0回答：1

1个回答

最新问题

如何使用 Python 和 Tesseract 改进低对比度和模糊报纸图像的 OCR 提取？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1