在Python中解析HTML时创建多维列表

Question

在帮助下，我设法更正了the link中的代码并将其增强为：

from bs4 import BeautifulSoup
import re
import os
from os.path import join
from click.termui import pause

BeforeFinalJob = []
TheFinalResultThatUsed = []
ListOfFiles = []

for (dirname, dirs, files) in os.walk('.'):
    for filename in files:
        if filename.endswith('.html'):
            thefile = os.path.join(dirname, filename)
            ListOfFiles += [thefile]
            with open(thefile, 'r') as f:
                contents = f.read()
                soup = BeautifulSoup(contents, 'lxml')
                Initialtext = soup.get_text()
                MediumText = Initialtext.lower().split()

                TokensToClean = [t for t in MediumText
                                if re.match(r'[^\W\d]*$', t)]

                removementWords = ['here', 'than']


                for somewords in range(len(TokensToClean)):
                    if TokensToClean[somewords] not in removementWords:
                        BeforeFinalJob.add(TokensToClean[somewords])
TheFinalResultThatUsed = list( dict.fromkeys(BeforeFinalJob) )

但是，文件列表和单词在不同的表中。我如何在list [a] [b]中得到结果，其中list [a]是名称文件，list [a] [b]是文件list [a]的单词？此外，列表[a]]中不应有任何双打

例如，列表[a] ==文件并列出[a] [b] ==一些单词，在文件中]

UPD：ZIP文件中有一些HTML：

https://www94.zippyshare.com/v/vJgy2sk1/file.html

[借助帮助，我设法纠正了链接中的代码并将其增强为：从bs4导入BeautifulSoup导入从os.path导入import os。从click.termui导入暂停导入join ...

Answer 1

下一个代码执行结构

'filename.html': {
    'path': 'D:\\folder\\filename.html',
    'words': [
        'one',
        'two'
    ]
}

在Python中解析HTML时创建多维列表

问题描述投票：0回答：1

1个回答

最新问题

在Python中解析HTML时创建多维列表

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1