beautifulSoup 屏幕抓取不正确嵌套的列表 <ul>s

Question

我对 BeautifulSoup 很陌生，在过去的三天里，我试图从 http://www.ucanews.com/diocesan-directory/html/ordinary-of-philippine-cagayandeoro- 获取教堂列表parishes.html.

数据似乎没有正确嵌套，而只是出于演示目的而标记。据说，层次结构是

Parishes
    District
    (data)
        Vicariate
        (data)
            Church
            (data)

然而我看到的是每个教堂都以子弹开头，每个条目都由两个换行符分隔。我所追求的字段名称是斜体的，并用“：”与实际数据分隔。每个单位条目（区|代牧区|教区）可能有一个或多个数据字段。

到目前为止，我可以梳理出一些数据，但我无法显示实体的名称。

soup=BeautifulSoup(page)
for e in soup.table.tr.findAll('i'):
    print e.string, e.nextSibling

最后，我希望按列转换数据：

district, vicariate, parish, address, phone, titular, parish priest, <field8>, <field9>, <field99>

希望能在正确的方向上得到良好的推动。

Answer 1

不幸的是，这会有点复杂，因为这种格式有一些您需要的数据没有被清晰的标记包含。

数据模型

另外，您对嵌套的理解并不完全正确。实际的天主教堂结构（不是这个文档结构）更像是：

District (also called deanery or vicariate. In this case they all seem to be Vicariates Forane.)
    Cathedral, Parish, Oratory

请注意，没有要求教区隶属于地区/教区，尽管他们通常这样做。我认为该文件说的是一个区之后列出的所有内容都属于该区，但你无法确定。

那里还有一个条目，不是教堂，而是社区（圣洛伦索菲律宾华人社区）。这些人在教堂中没有独特的身份或治理（即，它不是一座建筑）——相反，它是一个无领土的群体，由牧师负责照顾。

解析

我认为你应该采取渐进的方法：

找到所有
```
li
```
元素，每个元素都是一个“项目”
项目的名称是第一个文本节点
查找所有
```
i
```
元素：这些是键、属性值、列行等
直到下一个
```
i
```
（由
```
br
```
分隔）的所有文本都是该键的值。

此页面的一个特殊问题是它的 html 病态地糟糕，您需要使用

MinimalSoup

才能正确解析它。 特别是，

BeautifulSoup

认为

li

元素是嵌套的，因为没有

文档中任意位置的 ol

或

ul

！

此代码将为您提供元组列表的列表。每个元组都是一个项目的

('key','value')

对。

一旦有了这个数据结构，您就可以按照自己喜欢的方式进行规范化、转换、嵌套等，而将 HTML 留在后面。

from BeautifulSoup import MinimalSoup
import urllib

fp = urllib.urlopen("http://www.ucanews.com/diocesan-directory/html/ordinary-of-philippine-cagayandeoro-parishes.html")
html = fp.read()
fp.close()

soup = MinimalSoup(html);

root = soup.table.tr.td

items = []
currentdistrict = None
# this loops through each "item"
for li in root.findAll(lambda tag: tag.name=='li' and len(tag.attrs)==0):
    attributes = []
    parishordistrict = li.next.strip()
     # look for string "district" to determine if district; otherwise it's something else under the district
    if parishordistrict.endswith(' District'):
        currentdistrict = parishordistrict
        attributes.append(('_isDistrict',True))
    else:
        attributes.append(('_isDistrict',False))

    attributes.append(('_name',parishordistrict))
    attributes.append(('_district',currentdistrict))

    # now loop through all attributes of this thing
    attributekeys = li.findAll('i')

    for i in attributekeys:
        key = i.string # normalize as needed. Will be 'Address:', 'Parochial Victor:', etc
        # now continue among the siblings until we reach an <i> again.
        # these are "values" of this key
        # if you want a nested key:[values] structure, you can use a dict,
        # but beware of multiple <i> with the same name in your logic
        next = i.nextSibling
        while next is not None and getattr(next, 'name', None) != 'i':
            if not hasattr(next, 'name') and getattr(next, 'string', None):
                value = next.string.strip()
                if value:
                    attributes.append((key, value))
            next = next.nextSibling
    items.append(attributes)

from pprint import pprint
pprint(items)

beautifulSoup 屏幕抓取不正确嵌套的列表 <ul>s

问题描述投票：0回答：1

1个回答

数据模型

解析

最新问题

beautifulSoup 屏幕抓取不正确嵌套的列表 <ul>s

问题描述 投票：0回答：1

1个回答

数据模型

解析

最新问题

问题描述投票：0回答：1