如何使用 Python 和 Tesseract 改进低对比度和模糊报纸图像的 OCR 提取?

问题描述 投票:0回答:1

我正在开发一个 Django 应用程序,用于从剪报图像中提取文本。这些图像通常对比度低且模糊,并且包含各种文本块,例如标题、日期、小文本、粗体文本和一些模糊的字母。图像尺寸约为 256x350 像素,分辨率为 96 dpi。

在将图像输入 Tesseract OCR 之前,我尝试了多种预处理技术来增强图像,但仍然没有得到满意的结果。这是我当前的代码:

import os
from django.shortcuts import render
from django.core.files.storage import FileSystemStorage
import pytesseract
import cv2
from PIL import Image
import numpy as np
import re

# Set Tesseract command
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'

def clean_text(text):
    # Example of cleaning common OCR errors
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^\w\s,.!?;:()"\']', '', text)  # Keep common punctuation
    return text.strip()

def home(request):
    return render(request, 'layout/index.html')

def preprocess_image(image_path):
    # Read the image
    image = cv2.imread(image_path, cv2.IMREAD_COLOR)
    if image is None:
        raise ValueError(f"Error loading image: {image_path}")
    
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
    # Histogram equalization to enhance contrast
    gray = cv2.equalizeHist(gray)
    
    # Increase contrast further if necessary
    gray = cv2.convertScaleAbs(gray, alpha=2.5, beta=50)
    
    # Resize the image to increase readability (aggressively)
    height, width = gray.shape
    scale_factor = 2  # Increase the scale factor for more aggressive resizing
    new_width = int(width * scale_factor)
    new_height = int(height * scale_factor)
    resized_image = cv2.resize(gray, (new_width, new_height), interpolation=cv2.INTER_CUBIC)
    
    # Apply Gaussian Blur to remove noise
    blurred_image = cv2.GaussianBlur(resized_image, (3, 3), 0)
    
    # Apply adaptive thresholding and combine with Otsu's method
    _, binary_otsu = cv2.threshold(blurred_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    binary_adaptive = cv2.adaptiveThreshold(blurred_image, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
    
    # Combine the results
    binary = cv2.bitwise_and(binary_adaptive, binary_otsu)
    
    # Sharpen the image
    kernel_sharpening = np.array([[0, -1, 0], 
                                  [-1, 5, -1],
                                  [0, -1, 0]])
    binary = cv2.filter2D(binary, -1, kernel_sharpening)
    
    # Apply dilation and erosion to remove noise
    kernel = np.ones((3, 3), np.uint8)
    binary = cv2.dilate(binary, kernel, iterations=1)
    binary = cv2.erode(binary, kernel, iterations=1)
    
    return binary

def upload(request):
    if request.method == 'POST' and request.FILES.get('document'):
        document = request.FILES['document']
        fs = FileSystemStorage()
        filename = fs.save(document.name, document)
        uploaded_file_url = fs.url(filename)

        # Preprocess the uploaded document
        try:
            preprocessed_image = preprocess_image(fs.path(filename))
        except ValueError as e:
            return render(request, 'layout/index.html', {'error': str(e)})

        # Convert the processed image to PIL format
        preprocessed_image_pil = Image.fromarray(preprocessed_image)

        # Extract text using Tesseract
        custom_config = r'--oem 3 --psm 12 --dpi 1500'
        text = pytesseract.image_to_string(preprocessed_image_pil, config=custom_config)

        # Clean the extracted text
        text = clean_text(text)

        context = {
            'uploaded_file_url': uploaded_file_url,
            'text': text,
        }
        return render(request, 'layout/index.html', context)
    return render(request, 'layout/index.html')
<!DOCTYPE html>
<html>
<head>
    <title>Document Layout Detection</title>
</head>
<body>
    <h1>Upload Document</h1>
    <form method="post" enctype="multipart/form-data" action="{% url 'upload' %}">
        {% csrf_token %}
        <input type="file" name="document">
        <button type="submit">Upload</button>
    </form>

    {% if uploaded_file_url %}
        <h2>Uploaded Document:</h2>
        <img src="{{ uploaded_file_url }}" alt="Document">
        <h2>Extracted Text:</h2>
        <pre>{{ text }}</pre>
    {% endif %}
</body>
</html>
from django.urls import path
from . import views

urlpatterns are [
    path('', views.home, name='home'),
    path('upload/', views.upload, name='upload'),
]
python django numpy ocr tesseract
1个回答
0
投票

这是一个合理的启发,打印文档的大部分纸张是白色的(或者旧材料是黄色的,带有橙色斑点)。因此,如果您有可用的彩色扫描,那么单独的红色通道通常是更好的更高对比度起点。绿色通道总是值得一看,但蓝色通道通常只会增加噪音和狐臭痕迹并降低对比度。

为了使文本更具可读性,“均衡直方图”并不是您真正想要对单色图像执行的操作。您实际想要做的是将白色附近顶端的所有直方图峰值粉碎为 255,将底端的较小峰值粉碎为 0,然后将 N 个剩余的中间值线性分配给 (255*k)/N。截止的精确选择并不敏感,但此步骤会破坏半色调图像以使文本对比度更加清晰。

如果这还不够,并且文本仍然模糊,那么与根据图像经验确定的模糊相匹配的不锐化掩模通常效果最好,而不是粗略的“锐化”内核。

您应该能够避免完全对图像进行高斯模糊(由于均衡直方图的副作用,仅在工作流程中需要此步骤)。如果确实需要任何进一步的噪声预处理,去除椒盐噪声或中值滤波器或边缘保持平滑都是更好的选择。

高斯模糊可以帮助使扫描的半色调图像变得可以接受。

有时,在经历过战争的特别混乱的页面上,一种去噪方法比其他任何方法都效果更好。上面的工作流程应该适用于大多数材料,但一些尴尬的情况需要额外的手动干预才能获得最佳结果。

© www.soinside.com 2019 - 2024. All rights reserved.