我正在尝试使用 pandas 和 pdfquery 创建一个简单的 pdf scraper。我想使用 xml 坐标从 PDF 的每一页中获取所需的数据,将其放入数据框中,然后将数据框保存为 csv 文件。我在最后一部分遇到问题,我可以从单个 pdf/页面获取数据,但似乎无法让它在多个页面上工作。我是 python 的相对初学者,所以非常感谢您的帮助。
import pdfquery
import pandas as pd
pdf = pdfquery.PDFQuery(r'path')
pdf.load()
pdf.tree.write('pdfXML.txt', pretty_print = True)
def pdfscrape(pdf):
num_1 = pdf.pq('LTTextBoxHorizontal:overlaps_bbox("378.0,
759.06, 456.0, 769.06")').text()
num_2 = pdf.pq('LTTextBoxHorizontal:overlaps_bbox("30.0,
431.06, 360.0, 441.06")').text()
page = pd.DataFrame({ 'num1': num_1,'num2': num_2, },index=[0])
print(page)
pagecount = pdf.doc.catalog['Pages'].resolve()['Count']
master = pd.DataFrame()
for p in range(pagecount):
pdf.load(p)
page = pdfscrape(pdf)
master = master(pd.concat([page], ignore_index=True))
master.to_csv("output.csv", index=False)
我期望的结果是一个 csv 文件,其中包含 pdf 每页所需的数据点。相反,我得到:
Traceback (most recent call last):
master = master(pd.concat([page], ignore_index=True))
line 380, in concat
op = _Concatenator(
line 443, in __init__
objs, keys = self._clean_keys_and_objs(objs, keys)
line 539, in _clean_keys_and_objs
raise ValueError("All objects passed were None")
ValueError: All objects passed were None
您可以做的是加载您感兴趣的页面:
import pdfquery
import pandas as pd
def read_page(t):
query1 = (56.8, 771.397, 188.992, 783.397)
text1 = pdf.pq('LTTextLineHorizontal:overlaps_bbox("%d, %d, %d, %d")' % query1).text()
print(f"From function call: {text1}\n")
pdf = pdfquery.PDFQuery('Doc_for_PDF.pdf')
pdf.load() #load all pages for the dataframe
pdf.tree.write('pdfXML.xml', pretty_print = True)
df = pd.read_xml('pdfXML.xml', xpath='.//LTTextLineHorizontal')
#print(df.to_string())
print(df.head())
print()
t = pdf.tree.write('pdfXML.xml', pretty_print = True)
# load page by page here
for i in range(0, pdf.doc.catalog['Pages'].resolve()['Count']):
read_page(pdf.load(i))
输出:
y0 y1 ... word_margin LTTextBoxHorizontal
0 771.397 783.397 ... 0.1 This Text should be scrappt on first page
1 771.397 783.397 ... 0.1 This Text should be scrappt second page
2 729.997 741.997 ... 0.1 This not
[3 rows x 9 columns]
From function call: This Text should be scrappt on first page
From function call: This Text should be scrappt second page