使用 iText 提取 PDF 文本

Question

我们正在做信息提取方面的研究，我们想使用iText。

我们正在探索 iText。根据我们查阅的文献，iText 是最好用的工具。是否可以从 iText 中的每行 pdf 中提取文本？我已经在 stackoverflow 中阅读了与我的相关的问题帖子，但它只是读取文本而不是提取它。谁能帮我解决我的问题吗？谢谢你。

Answer 1

就像 Theodore 说的，你可以从 pdf 中提取文本，并且像 Chris 指出的那样

只要它实际上是文本（不是轮廓或位图）

最好的办法是购买 Bruno Lowagie 的书《Itext in action》。在第二版中，第 15 章介绍了文本提取。

但是你可以看看他的网站上的例子。 http://itextpdf.com/examples/iia.php?id=279

您可以解析它以创建一个纯文本文件。这是一个代码示例：

/*
 * This class is part of the book "iText in Action - 2nd Edition"
 * written by Bruno Lowagie (ISBN: 9781935182610)
 * For more info, go to: http://itextpdf.com/examples/
 * This example only works with the AGPL version of iText.
 */

package part4.chapter15;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;

public class ExtractPageContent {

    /** The original PDF that will be parsed. */
    public static final String PREFACE = "resources/pdfs/preface.pdf";
    /** The resulting text file. */
    public static final String RESULT = "results/part4/chapter15/preface.txt";

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
     * @throws IOException
     */
    public void parsePdf(String pdf, String txt) throws IOException {
        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            out.println(strategy.getResultantText());
        }
        reader.close();
        out.flush();
        out.close();
    }

    /**
     * Main method.
     * @param    args    no arguments needed
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        new ExtractPageContent().parsePdf(PREFACE, RESULT);
    }
}

注意许可证

此示例仅适用于 AGPL 版本的 iText。

如果您查看其他示例，它将展示如何省略部分文本或如何提取 pdf 的部分。

希望有帮助。

Answer 2

iText 允许您做到这一点，但不能保证文本块的粒度，这些取决于生成文档时使用的实际 pdf 渲染器。

每个单词甚至字母很可能都有自己的文本块。这些也不需要按词汇顺序排列，为了获得可靠的结果，您可能必须根据文本块的坐标重新排序。此外，您可能还需要计算是否需要在文本块之间插入空格。

Answer 3

在较新版本的 itext 中：

public static void main(String[] args) throws Exception {
    try (var document = new PdfDocument(new PdfReader("your.pdf"))) {
        var strategy = new SimpleTextExtractionStrategy();
        for (int i = 1; i < document.getNumberOfPages(); i++) {
            String text = PdfTextExtractor.getTextFromPage(document.getPage(i), strategy);
            System.out.println(text);
        }
    }
}

Answer 4

一个问题。当我尝试阅读我的 PDF 时，我得到了这个回报。

f...'11MPLES \n���������������������������\nILIClotüL \n�\n��������������������������� !\n\"#$���������������������������\n\"#$��%&��’�()(*�+\n,�-\"./-����0�1��2&���3�3�\n�\n456789:;<=>?@A6B:6C:8D;EFGE8D@\nHI,J�KL2(M����N3�0!3��O I����P�#��2��(�+��,PQR’QS\"IHP�TQUHV�H\"Q�WT-\"\n-�X�����\"Y��X$����� �������0 Q�)(������\"#$��%&���H��#�XZ*M(� ’#X�*X��#�+��.(�#+�2�I�M(�*�+��.(�\n�\n[56789:;<=>?@A6B=6\\]G;=>^:6_‘abc4[d[_[4_4__4\n,��e�������\

有什么解决办法吗？

我有两个 PDF，其中一个 pdfReader 中的 pdfVersion 显示为 52 '4'，读起来很漂亮。错误的是 54 '7' pdfVersion，并像这样返回。

使用 iText 提取 PDF 文本

问题描述投票：0回答：4

4个回答

最新问题

使用 iText 提取 PDF 文本

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4