在Python中解析HTML时创建多维列表

问题描述 投票:0回答:1

在帮助下,我设法更正了the link中的代码并将其增强为:

from bs4 import BeautifulSoup
import re
import os
from os.path import join
from click.termui import pause

BeforeFinalJob = []
TheFinalResultThatUsed = []
ListOfFiles = []

for (dirname, dirs, files) in os.walk('.'):
    for filename in files:
        if filename.endswith('.html'):
            thefile = os.path.join(dirname, filename)
            ListOfFiles += [thefile]
            with open(thefile, 'r') as f:
                contents = f.read()
                soup = BeautifulSoup(contents, 'lxml')
                Initialtext = soup.get_text()
                MediumText = Initialtext.lower().split()

                TokensToClean = [t for t in MediumText
                                if re.match(r'[^\W\d]*$', t)]

                removementWords = ['here', 'than']


                for somewords in range(len(TokensToClean)):
                    if TokensToClean[somewords] not in removementWords:
                        BeforeFinalJob.add(TokensToClean[somewords])
TheFinalResultThatUsed = list( dict.fromkeys(BeforeFinalJob) )

但是,文件列表和单词在不同的表中。我如何在list [a] [b]中得到结果,其中list [a]是名称文件,list [a] [b]是文件list [a]的单词?此外,列表[a]]中不应有任何双打

例如,列表[a] ==文件并列出[a] [b] ==一些单词,在文件中]

UPD:ZIP文件中有一些HTML:

https://www94.zippyshare.com/v/vJgy2sk1/file.html

[借助帮助,我设法纠正了链接中的代码并将其增强为:从bs4导入BeautifulSoup导入从os.path导入import os。从click.termui导入暂停导入join ...

python parsing multidimensional-array beautifulsoup
1个回答
0
投票

下一个代码执行结构

'filename.html': {
    'path': 'D:\\folder\\filename.html',
    'words': [
        'one',
        'two'
    ]
}
© www.soinside.com 2019 - 2024. All rights reserved.