如何使用 Camelot 在 PDF 中搜索文本并获取表格区域？

Question

我正在使用 Camelot 从 PDF 中提取表格数据。 Camelot 工作得很好，但我有一个页面有几个表，而我只需要一个。我想根据正则表达式搜索找到那个。

如果我运行指定表格区域的代码，它会找到该表格。（如果我不指定参数，则假设整个页面是一张表）。

table = camelot.read_pdf(file, flavor="stream", pages='5', table_areas=['20, 530, 550, 350'], row_tol=15)

camelot.plot(table[0], kind='contour')

蓝色框是文本。我只关心红框中的文本表。

我的问题：鉴于我知道我正在搜索的文本，我如何搜索并获取近似的表格区域，然后将其传递给 Camelot？我已经有工作代码来搜索正则表达式（PyMuPDF）。

由于 Camelot 返回文本，我必须认为有一种方法可以知道框坐标，但我无法通过查看他们的文档来看到它，文档在这里：

https://camelot-py.readthedocs.io/en/master/api.html#lower-level-classes

我确信有一个 OpenCV 解决方案，但如果可能的话，我想先使用 Camelot。感谢任何帮助。谢谢。

Answer 1

如果你的表格有清晰的外边框，你可以先使用lattice reader，然后将获得的bbox应用到Stream中。


    for i in range(1, len(reader.pages) + 1):
        page = reader.pages[i - 1]
        lattice_tables = camelot.read_pdf(file_path, pages=str(i), flavor='lattice', line_scale=40)
        for lattice_table in lattice_tables:
            x1 = lattice_table._bbox[0]
            y2 = lattice_table._bbox[3]
            x2 = lattice_table._bbox[2]
            y1 = lattice_table._bbox[1]

            if x1 > x2:
                x1, x2 = x2, x1
            if y1 < y2:
                y1, y2 = y2, y1

            table_area = f"{int(x1)},{int(y1)},{int(x2)},{int(y2)}"
            try:
                stream_tables = camelot.read_pdf(file_path, pages=str(i), flavor='stream', table_areas=[table_area])
            except (ZeroDivisionError, ValueError, IndexError):
                stream_tables = [lattice_table]
            for t, stream_table in enumerate(stream_tables):
                # Do what you need

如何使用 Camelot 在 PDF 中搜索文本并获取表格区域？

问题描述投票：0回答：1

1个回答

最新问题

如何使用 Camelot 在 PDF 中搜索文本并获取表格区域？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1