将网络分组结果分组

问题描述 投票:1回答:1

在尝试学习如何使用Python进行网页抓取时,我从这个http://bramatno8.kvartersmenyn.se/获取了午餐菜单

该页面构建如下:

<div class="menu">
<strong>Monday<br></strong>
<br>
Food 1<br>
Food 2
<br><br>
<strong>Tuesday<br></strong>
<br>
Food 3<br>
Food 4
<br><br>
<strong>Wednesday<br></strong>
<br>
Food 5<br>
Food 6
<br><br>
<strong>Thursday<br></strong>
<br>
Food 7<br>
Food 8
<br><br>
<strong>Friday<br></strong>
<br>
Food 9<br>
Food 10
<br><br>
</div>

所以到目前为止我得到的是:

import requests
from bs4 import BeautifulSoup

url = 'http://lunchmenu.com'

fetchlunch = requests.get(url)

soup = BeautifulSoup(fetchlunch.text, 'html.parser')

menu = soup.findAll(class_='menu')[0]

for br in menu.find_all('br'):
    br.replace_with('\n')

print(menu.get_text())

因此,这将在一个部分打印本周的整个菜单。

我想做的就是获取一天的菜单。即如果是星期二,则只显示星期二的菜单。所以我想我需要将结果放在一个数组中然后拉出当天的菜单?

python web-scraping beautifulsoup
1个回答
3
投票

一种方法是找到具有匹配日内容的<strong>标签,然后使用.next_siblings向前遍历食物,直到你击中另一个<strong>或用尽兄弟姐妹。我使用了lxml解析器,但这也适用于html.parser

这是在你的样本DOM(我调整食物,以明确它的工作原理):

import bs4
import requests

day = "Tuesday"
dom = """
<div class="menu">
<strong>Monday</strong>
<br>
Food 1<br>
Food 2
<br><br>
<strong>Tuesday</strong>
<br>
Food 3<br>
Food 4
<br><br>
<strong>Wednesday</strong>
<br>
Food 5<br>
Food 6
<br><br>
<strong>Thursday</strong>
<br>
Food 7<br>
Food 8
<br><br>
<strong>Friday</strong>
<br>
Food 9<br>
Food 10
<br><br>
</div>
"""

soup = bs4.BeautifulSoup(dom, "lxml")
menu = soup.find(class_ = "menu")
foods = []

for elem in menu.find("strong", text=day).next_siblings:
    if elem.name == "strong": 
        break

    if isinstance(elem, bs4.element.NavigableString) and elem.strip() != "":
        foods.append(elem.strip())

print(foods)

输出:

['Food 3', 'Food 4']

这是第一个现场https://www.kvartersmenyn.se/rest/15494。请注意扩展字符编码和lambda,以便在<b>标记中有额外内容时使匹配起作用:

# -*- coding: latin1 -*-

import bs4
import requests

day = "Måndag"
url = "https://www.kvartersmenyn.se/rest/15494"

soup = bs4.BeautifulSoup(requests.get(url).text, "lxml")
menu = soup.find(class_ = "meny")
foods = []

for elem in menu.find("b", text = lambda x: day in x).next_siblings:
    if elem.name == "b": 
        break

    if isinstance(elem, bs4.element.NavigableString):
        foods.append(elem)

print(day)

for food in foods:
    print(food)

输出:

Måndag
A: Gaeng phed**
röd curry i cocosmjölk med sötbasilika, wokade blandade grönsaker
B: Ghai phad med mauang** (biff) wok i chilipaste med cashewnötter, grönsaker
C: Phad bamme (fläsk) wokade äggnudlar i ostronsås, grönsaker
D: Satay gay currymarinerade kycklingfiléspett med jordnötssås
E: Gai chup pheng tood*
Friterad kyckling med söt chilisås och ris
F: Phad bambou* (biff) wok i ostronsås med bambu, lök, champinjoner

最后,这是在你的第二个现场网站http://bramatno8.kvartersmenyn.se/。所有这些网站都有不同且不一致的结构,所以如果它们都有一颗银弹就不明显了。我怀疑这些菜单是由可能不理解文档结构的人手工编码的,因此处理页面的任意更新需要一些工作。

开始:

# -*- coding: latin1 -*-

import bs4
import requests

day = "Måndag"
url = "http://bramatno8.kvartersmenyn.se/"

soup = bs4.BeautifulSoup(requests.get(url).text, "lxml")
menu = soup.find(class_ = "meny")
foods = []

for elem in menu.find(text = day).parent.next_siblings:
    if elem.name == "strong": 
        break

    if isinstance(elem, bs4.element.NavigableString):
        foods.append(elem)

print(day)

for food in foods:
    print(food)

输出:

Måndag
Viltskav med rårörda lingon (eko), vaxbönor och potatispuré
Sesambakad blomkål med sojamarinerade böngroddar, salladslök, rädisa och sojabönor samt ris
© www.soinside.com 2019 - 2024. All rights reserved.