我正在尝试制作一个程序,它将遍历所有文件夹和子文件夹,找到所有 OpenOffice 文档,打开它们,然后计算文件中存在的单词数。这个想法是稍后对总数进行求和并输出在给定文件夹中找到的单词总数。
我正在使用 odfpy 库来操作 .odt 文件,但我能找到的所有示例和文档都更关心向文档添加内容、获取给定元素中的样式、替换某些内容等。我不能查找有关如何简单地获取文档中的文本的任何文档或示例。
编辑:谢谢你 karatekraft,你的答案正是我所需要的。您的代码似乎获得了字符总数而不是单词总数,但这至少在我的能力范围内!
新的 def count_words_in_file(file_list) 看起来像这样! (目前只检查file_path变量中添加的文档,但现在要修复已经很晚了。)
def count_words_in_file(file_list):
# This function will open all found .odt files, count the words, and then sum the total
# ADJUST SO IT DOES THE SEARCH FOR ALL FILES
file_path = "test.odt"
from odf import text
# Read document
document_text = load(file_path)
# Get all paragraphs in document
all_paragraphs = document_text.getElementsByType(text.P)
final_word_count = 0
# For each paragraph, extract text and count number words.
for paragraph in all_paragraphs:
text = teletype.extractText(paragraph)
words = text.split(" ")
while '' in words:
words.remove('')
print(words)
final_word_count = final_word_count + len(words)
print(f"Final word count: {final_word_count}")
# This program will count the number of words and .odt docs
# in a folder and all its sub-folders. For ease of use it will check the folders above its
# current directory.
# Import the needed libraries
import os
from odf.opendocument import OpenDocumentText
from odf import text
# Make relly fucking sure the ODFPY module is installed, was pain in asshole. fuck programing
def main():
# This variable is the current location of the script, attained with the os.path
current_dir = os.path.dirname(os.path.abspath(__file__))
# This variable changes the current_dir into the dir above the current one.
above_dir = current_dir + "\.."
# Call the function to scan for .odt files
file_list = scan_for_files(above_dir)
# Call the function to open and count the .odt files
count_words_in_file(file_list)
def scan_for_files(above_dir):
# This list will store the path to all files found.
file_list = []
# This for-loop will go through all the folders that can be found
for folder, subfolder, files in os.walk(above_dir):
for file in files:
complete_path = os.path.join(folder, file)
file_list.append(complete_path)
return(file_list)
def count_words_in_file(file_list):
# This function will open all found .odt files, count the words, and then sum the total
for file in file_list:
if file.endswith(".odt"):
textdoc = OpenDocumentText()
for paragraph in textdoc.body.childNodes:
print(paragraph)
main()
你可以试试这个。它识别所有段落,从每个段落中提取文本,并获取总字数。
from odf import text, teletype
from odf.opendocument import load
file_path = "my_file.odt"
# Read document
document_text = load(file_path)
# Get all paragraphs in document
all_paragraphs = document_text.getElementsByType(text.P)
final_word_count = 0
# For each paragraph, extract text and count number words.
for paragraph in all_paragraphs:
text = teletype.extractText(paragraph)
final_word_count = final_word_count + len(text)
print(f"Final word count: {final_word_count}")
基于 karatekraft 的答案,您还可以询问文档元数据。
from odf import meta
from odf.opendocument import load
file_path = "my_file.odt"
document_text = load(file_path)
word_count = int(document_text.meta.getElementsByType(meta.DocumentStatistic)[0].getAttribute("wordcount"))
print(f"Word count: {word_count}")