在帮助下,我设法更正了the link中的代码并将其增强为:
from bs4 import BeautifulSoup
import re
import os
from os.path import join
from click.termui import pause
BeforeFinalJob = []
TheFinalResultThatUsed = []
ListOfFiles = []
for (dirname, dirs, files) in os.walk('.'):
for filename in files:
if filename.endswith('.html'):
thefile = os.path.join(dirname, filename)
ListOfFiles += [thefile]
with open(thefile, 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
Initialtext = soup.get_text()
MediumText = Initialtext.lower().split()
TokensToClean = [t for t in MediumText
if re.match(r'[^\W\d]*$', t)]
removementWords = ['here', 'than']
for somewords in range(len(TokensToClean)):
if TokensToClean[somewords] not in removementWords:
BeforeFinalJob.add(TokensToClean[somewords])
TheFinalResultThatUsed = list( dict.fromkeys(BeforeFinalJob) )
但是,文件列表和单词在不同的表中。我如何在list [a] [b]中得到结果,其中list [a]是名称文件,list [a] [b]是文件list [a]的单词?此外,列表[a]]中不应有任何双打
例如,列表[a] ==文件并列出[a] [b] ==一些单词,在文件中]
UPD:ZIP文件中有一些HTML:
https://www94.zippyshare.com/v/vJgy2sk1/file.html
[借助帮助,我设法纠正了链接中的代码并将其增强为:从bs4导入BeautifulSoup导入从os.path导入import os。从click.termui导入暂停导入join ...
下一个代码执行结构
'filename.html': {
'path': 'D:\\folder\\filename.html',
'words': [
'one',
'two'
]
}