我正在开发一个 Django 应用程序,用于从剪报图像中提取文本。这些图像通常对比度低且模糊,并且包含各种文本块,例如标题、日期、小文本、粗体文本和一些模糊的字母。图像尺寸约为 256x350 像素,分辨率为 96 dpi。
在将图像输入 Tesseract OCR 之前,我尝试了多种预处理技术来增强图像,但仍然没有得到满意的结果。这是我当前的代码:
import os
from django.shortcuts import render
from django.core.files.storage import FileSystemStorage
import pytesseract
import cv2
from PIL import Image
import numpy as np
import re
# Set Tesseract command
pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'
def clean_text(text):
# Example of cleaning common OCR errors
text = re.sub(r'\s+', ' ', text) # Remove extra whitespace
text = re.sub(r'[^\w\s,.!?;:()"\']', '', text) # Keep common punctuation
return text.strip()
def home(request):
return render(request, 'layout/index.html')
def preprocess_image(image_path):
# Read the image
image = cv2.imread(image_path, cv2.IMREAD_COLOR)
if image is None:
raise ValueError(f"Error loading image: {image_path}")
# Convert to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Histogram equalization to enhance contrast
gray = cv2.equalizeHist(gray)
# Increase contrast further if necessary
gray = cv2.convertScaleAbs(gray, alpha=2.5, beta=50)
# Resize the image to increase readability (aggressively)
height, width = gray.shape
scale_factor = 2 # Increase the scale factor for more aggressive resizing
new_width = int(width * scale_factor)
new_height = int(height * scale_factor)
resized_image = cv2.resize(gray, (new_width, new_height), interpolation=cv2.INTER_CUBIC)
# Apply Gaussian Blur to remove noise
blurred_image = cv2.GaussianBlur(resized_image, (3, 3), 0)
# Apply adaptive thresholding and combine with Otsu's method
_, binary_otsu = cv2.threshold(blurred_image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
binary_adaptive = cv2.adaptiveThreshold(blurred_image, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
# Combine the results
binary = cv2.bitwise_and(binary_adaptive, binary_otsu)
# Sharpen the image
kernel_sharpening = np.array([[0, -1, 0],
[-1, 5, -1],
[0, -1, 0]])
binary = cv2.filter2D(binary, -1, kernel_sharpening)
# Apply dilation and erosion to remove noise
kernel = np.ones((3, 3), np.uint8)
binary = cv2.dilate(binary, kernel, iterations=1)
binary = cv2.erode(binary, kernel, iterations=1)
return binary
def upload(request):
if request.method == 'POST' and request.FILES.get('document'):
document = request.FILES['document']
fs = FileSystemStorage()
filename = fs.save(document.name, document)
uploaded_file_url = fs.url(filename)
# Preprocess the uploaded document
try:
preprocessed_image = preprocess_image(fs.path(filename))
except ValueError as e:
return render(request, 'layout/index.html', {'error': str(e)})
# Convert the processed image to PIL format
preprocessed_image_pil = Image.fromarray(preprocessed_image)
# Extract text using Tesseract
custom_config = r'--oem 3 --psm 12 --dpi 1500'
text = pytesseract.image_to_string(preprocessed_image_pil, config=custom_config)
# Clean the extracted text
text = clean_text(text)
context = {
'uploaded_file_url': uploaded_file_url,
'text': text,
}
return render(request, 'layout/index.html', context)
return render(request, 'layout/index.html')
<!DOCTYPE html>
<html>
<head>
<title>Document Layout Detection</title>
</head>
<body>
<h1>Upload Document</h1>
<form method="post" enctype="multipart/form-data" action="{% url 'upload' %}">
{% csrf_token %}
<input type="file" name="document">
<button type="submit">Upload</button>
</form>
{% if uploaded_file_url %}
<h2>Uploaded Document:</h2>
<img src="{{ uploaded_file_url }}" alt="Document">
<h2>Extracted Text:</h2>
<pre>{{ text }}</pre>
{% endif %}
</body>
</html>
from django.urls import path
from . import views
urlpatterns are [
path('', views.home, name='home'),
path('upload/', views.upload, name='upload'),
]
这是一个合理的启发,打印文档的大部分纸张是白色的(或者旧材料是黄色的,带有橙色斑点)。因此,如果您有可用的彩色扫描,那么单独的红色通道通常是更好的更高对比度起点。绿色通道总是值得一看,但蓝色通道通常只会增加噪音和狐臭痕迹并降低对比度。
为了使文本更具可读性,“均衡直方图”并不是您真正想要对单色图像执行的操作。您实际想要做的是将白色附近顶端的所有直方图峰值粉碎为 255,将底端的较小峰值粉碎为 0,然后将 N 个剩余的中间值线性分配给 (255*k)/N。截止的精确选择并不敏感,但此步骤会破坏半色调图像以使文本对比度更加清晰。
如果这还不够,并且文本仍然模糊,那么与根据图像经验确定的模糊相匹配的不锐化掩模通常效果最好,而不是粗略的“锐化”内核。
您应该能够避免完全对图像进行高斯模糊(由于均衡直方图的副作用,仅在工作流程中需要此步骤)。如果确实需要任何进一步的噪声预处理,去除椒盐噪声或中值滤波器或边缘保持平滑都是更好的选择。
高斯模糊可以帮助使扫描的半色调图像变得可以接受。
有时,在经历过战争的特别混乱的页面上,一种去噪方法比其他任何方法都效果更好。上面的工作流程应该适用于大多数材料,但一些尴尬的情况需要额外的手动干预才能获得最佳结果。