# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open('C:\\sem1\\691-project\\Dataset\\Maths\\A Spiral Workbook for Discrete Mathematics.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfReader(pdfFileObj)
# printing number of pages in pdf file
print(len(pdfReader.pages))
# creating a page object
pageObj = pdfReader.pages[89]
# extracting text from page
print(pageObj.extract_text())
# closing the pdf file object
pdfFileObj.close()
我只能从一页提取文本,但无法从多页提取文本。
您只需要迭代 PDF 的所有页面即可,
请参阅这一行:
pageObj = pdfReader.pages[89]
您实际上所做的是获取文档中的第 90 页,因为页面的索引为零(从 0 开始,第 1 页是第 0 页,第 2 页是第 1 页......)。
相反,像这样循环浏览所有页面:
for pageObj in pdfReader.pages:
page_text = pageObj.extract_text()
print(page_text)
将文本保存到文件中:
您在循环之前打开文件并将每个页面的文本附加到文件末尾。
out_file = open('C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.txt', 'a')
for pageObj in pdfReader.pages:
page_text = pageObj.extract_text()
print(page_text)
out_file.write(page_text)
out_file.close()
您的整个代码可能如下所示:
import PyPDF2
pdfFileObj = open('C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.pdf', 'rb')
pdfReader = PyPDF2.PdfReader(pdfFileObj)
out_file = open('C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.txt', 'a')
for pageObj in pdfReader.pages:
page_text = pageObj.extract_text()
print(page_text)
out_file.write(page_text)
out_file.close()
pdfFileObj.close()
请注意,此代码将始终附加在此文件的末尾
'C:\sem1\691-project\Dataset\Maths\A Spiral Workbook for Discrete Mathematics.txt'
,如果多次运行此代码,请务必先清空文件。