如何编写shell脚本来查找PDF中的页数？

Question

我正在动态生成 PDF。如何使用 shell 脚本检查 PDF 的页数？

Answer 1

无需任何额外包装：

strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
    | sort -rn | head -n 1

使用pdfinfo：

pdfinfo file.pdf | awk '/^Pages:/ {print $2}'

使用pdftk：

pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'

您还可以通过 pdfinfo 递归地求出所有 PDF 中的总页数，如下所示：

find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
    awk '/^Pages:/ {n += $2} END {print n}'

Answer 2

imagemagick 库提供了一个名为“identify”的工具，它与计算输出行数相结合，可以让您得到所需的结果...imagemagick 可以通过brew 在 osx 上轻松安装。

这是一个功能性 bash 脚本，它将其捕获到 shell 变量并将其转储回屏幕......

#/bin/bash
pdfFile=$1
echo "Processing $pdfFile"
numberOfPages=$(/usr/local/bin/identify "$pdfFile" 2>/dev/null | wc -l | tr -d ' ')
#Identify gets info for each page, dump stderr to dev null
#count the lines of output
#trim the whitespace from the wc -l outout
echo "The number of pages is: $numberOfPages"

以及运行它的输出...

$ ./countPages.sh aSampleFile.pdf 
Processing aSampleFile.pdf
The number of pages is: 2
$

Answer 3

pdftotext

实用程序将 pdf 文件转换为文本格式，并在页面之间插入分页符。（又名：换页符

$'\f'

）：

NAME
       pdftotext - Portable Document Format (PDF) to text converter.

SYNOPSIS
       pdftotext [options] [PDF-file [text-file]]

DESCRIPTION
       Pdftotext converts Portable Document Format (PDF) files to plain text.

       Pdftotext  reads  the PDF file, PDF-file, and writes a text file, text-file.  If text-file is
       not specified, pdftotext converts file.pdf to file.txt.  If text-file is  ´-',  the  text  is
       sent to stdout.

有多种组合可以解决您的问题，选择其中一种：

1）pdftotext + grep：

$ pdftotext file.pdf - | grep -c $'\f'

2）pdftotext + awk（v1）：

$ pdftotext file.pdf - | awk 'BEGIN{n=0} {if(index($0,"\f")){n++}} END{print n}'

3）pdftotext + awk（v2）：

$ pdftotext sample.pdf - | awk 'BEGIN{ RS="\f" } END{ print NR }'

4）pdftotext + awk（v3）：

$ pdftotext sample.pdf - | awk -v RS="\f" 'END{ print NR }'

希望有帮助！

Answer 4

这里是直接命令行的版本（基于pdfinfo）：

for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done

Answer 5

这是使用

pdftoppm

的完整破解，它预装在 Ubuntu 上（至少在 Ubuntu 18.04 和 20.04 上测试过）：

# for a pdf withOUT a password
pdftoppm mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

# for a pdf WITH a password which is `1234`
pdftoppm -upw 1234 mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

这是如何运作的？好吧，如果您指定的

第一页大于 PDF 中的页面（我指定页码

，这对于所有已知的 PDF 来说都太大了），它将在

stderr

中打印以下错误：

给出的页面范围错误：第一页（1000000）不能在最后一页（142）之后。

因此，我将

stderr

消息通过

stdout

传输到

2>&1

，如此处所述，然后将其传输到 grep 以将

(142).

部分与此正则表达式 (

([0-9]*)\.$

) 相匹配，然后我使用正则表达式 (

[0-9]*

) 再次将其通过管道传递到 grep 以查找数字，在本例中为

。就是这样！

包装器功能和速度测试

这里有几个包装函数来测试这些：

# get the total number of pages in a PDF; technique 1.
# See this ans here: https://stackoverflow.com/a/14736593/4561887
# Usage (works on ALL PDFs--whether password-protected or not!):
#       num_pgs="$(getNumPgsInPdf "path/to/mypdf.pdf")"
# SUPER SLOW! Putting `time` just in front of the `strings` cmd shows it takes ~0.200 sec on a 142
# pg PDF!
getNumPgsInPdf() {
    _pdf="$1"

    _num_pgs="$(strings < "$_pdf" | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
        | sort -rn | head -n 1)"

    echo "$_num_pgs"
}

# get the total number of pages in a PDF; technique 2.
# See my ans here: https://stackoverflow.com/a/66963293/4561887
# Usage, where `pw` is some password, if the PDF is password-protected (leave this off for PDFs
# with no password):
#       num_pgs="$(getNumPgsInPdf2 "path/to/mypdf.pdf" "pw")"
# SUPER FAST! Putting `time` just in front of the `pdftoppm` cmd shows it takes ~0.020 sec OR LESS
# on a 142 pg PDF!
getNumPgsInPdf2() {
    _pdf="$1"
    _password="$2"

    if [ -n "$_password" ]; then
        _password="-upw $_password"
    fi

    _num_pgs="$(pdftoppm $_password "$_pdf" -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
        | grep -o '[0-9]*')"

    echo "$_num_pgs"
}

使用前面的

time

命令测试它们，结果表明

strings

非常慢，在 142 页 pdf 上需要 ~0.200 秒，而

pdftoppm

则非常快，需要 ~0.020 秒或同一个 pdf 上的 less。下面 Ocaso 的答案中的

pdfinfo

技术也非常快——与

pdftoppm

技术相同。

另请参阅

Ocaso Protal 的这些很棒的答案。
上面的这些函数将在我的
```
pdf2searchablepdf
```
项目中使用：https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF。

Answer 6

mupdf/mutool 解决方案：

mutool info tmp.pdf | grep '^Pages' | cut -d ' ' -f 2

Answer 7

刚刚挖出一个旧脚本（ksh 中的）我发现：

#!/usr/bin/env ksh
# Usage: pdfcount.sh file.pdf
#
# Optimally, this would be a mere:
#       pdfinfo file.pdf | grep Pages | sed 's/[^0-9]*//'

[[ "$#" != "1" ]] && {
   printf "ERROR: No file specified\n"
   exit 1
}

numpages=0
while read line; do
   num=${line/*([[:print:]])+(Count )?(-)+({1,4}(\d))*([[:print:]])/\4}
   (( num > numpages)) && numpages=$num
done < <(strings "$@" | grep "/Count")
print $numpages

Answer 8

如果您使用的是 macOS，您可以像这样查询 pdf 元数据：

mdls -name kMDItemNumberOfPages -raw file.pdf

如此处所示https://apple.stackexchange.com/questions/225175/get-number-of-pdf-pages-in-terminal

Answer 9

另一个更好地利用选项的 mutool 解决方案：

mutool show file.pdf Root/Pages/Count

Answer 10

我对 Marius Hofert 技巧进行了一些改进，以对返回值求和。

for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done | awk '{s+=$1}END{print s}'

Answer 11

为了构建 Marius Hofert 的答案，此命令使用 bash for 循环来显示页数、显示文件名，并且它将忽略文件扩展名的大小写。

for f in *.[pP][dD][fF]; do pdfinfo "$f" | grep Pages | awk '{printf $2 }'; echo " $f"; done

Answer 12

QPDF 提供了我所知道的最简单的方法。

qpdf --show-npages input.pdf

如何编写shell脚本来查找PDF中的页数？

问题描述投票：0回答：12

12个回答

包装器功能和速度测试

另请参阅

最新问题

如何编写shell脚本来查找PDF中的页数？

问题描述 投票：0回答：12

12个回答

包装器功能和速度测试

另请参阅

最新问题

问题描述投票：0回答：12