已解决-以横向格式检测图像并将其切成两幅肖像

Question

我已经扫描了PDF格式的文档。这些文档包含纵向为一页的页面和横向为两页的页面。

我将需要对它们执行OCR处理，但是我将需要使用bash脚本重新格式化它们。

我可以用pdfimages从PDF中提取图像，将它们与img2pdf放在一起，并用ocrmypdf进行OCR处理。

但是我很难使用ImageMagick的实用程序来检测它们的方向，并在必要时以纵向模式将其切成两幅图像。您应该知道，并非所有扫描都具有相同的尺寸，并且纵向和横向图像混合在同一PDF中。

从现在开始，我只有脚本的开头：

#!/bin/bash
for i in *.pdf;
do
  # Créer le fichier PDF avec OCR
  ocrmypdf --language fra --deskew --remove-background --clean-final --optimize 3 "$i" OCR/"$i"
done

以及一些测试，例如：pdfimages "MyFile.pdf" tmp/"MyFile"和img2pdf tmp/*.ppm | ocrmypdf --language fra --deskew --remove-background --clean-final --optimize 3 - OCR/"MyFile.pdf"

任何人都有一个想法如何执行这些测试，并且文档中只有纵向页面吗？

谢谢，祝您有美好的一天！

Answer 1

没关系，我终于设法编写了脚本。如果有人需要，我会在这里分享。

#!/bin/bash
for pdf in *.pdf;
do
  # Displays the PDF file to be processed
  echo "Processing of file \"$pdf\"…"

  # Temporary folder
  TmpRep="/tmp/conversion$(date +%Y%m%d)$(date +%H%M%S)"
  mkdir --parents "$TmpRep"

  # Extract the PDF to the temporary directory
  pdfimages "$pdf" "$TmpRep/${pdf%.pdf}"

  for img in "$TmpRep"/*.ppm;
  do
    # Retrieves the image sizes
    Width=$(identify -format "%w" "$img")
    Height=$(identify -format "%h" "$img")

    # Checks if the image is landscape
    if [ $Width -gt $Height ]
    then
      # Cutting into two portraits
      convert "$img" -crop 2x1@ "${img%.*}_%d.ppm"

      # Deletes the original
      rm "$img"
    fi
  done

  # Create the PDF file with OCR from the images
  img2pdf "$TmpRep"/*.ppm | ocrmypdf --language eng --deskew --remove-background --clean-final --tesseract-timeout 240 --optimize 3 - "${pdf%.pdf} - OCR.pdf"

  # Deletes the temporary folder
  rm -rf "$TmpRep"/
  echo "Processing of file \"$pdf\" done."
done

此脚本处理当前目录中的所有PDF文件。它将图像提取到一个临时目录中。将风景图像切成两半。重新创建在其上执行OCR处理的PDF文件。和一些清理。新的PDF以OldName-OCR.pdf。

结尾。

已解决-以横向格式检测图像并将其切成两幅肖像

问题描述投票：0回答：1

1个回答

最新问题

已解决-以横向格式检测图像并将其切成两幅肖像

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1