[当我使用BeautifulSoup和regex(re.compile)中的findAll在python中进行网络抓取时,我无法使用CSS类正确循环它

问题描述 投票:-1回答:1

我是BeautifulSoup和Python的新手。因此,在此WP网站上,主页上有4篇文章,但它只给了我3篇文章,因此附有3张图片。有没有更简单的方法可以做到这一点?

import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, features="html5lib")
articles = bsObj.findAll("article", {"class": "post"})
print(len(articles))

for article in articles:
  image = bsObj.findAll("img", {"src": re.compile("/wp-content/uploads/.*.jpg")})
  print(image)
regex python-3.x beautifulsoup css-selectors
1个回答
1
投票

现在您已经弄清了文章计数的问题,确实没有更简单的解决方案。如果您想签出,可能还有其他版本。

您的简化版代码为:

from urllib.request import urlopen
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, "html")
articles = bsObj.findAll("article", {"class": "post"})

for article in articles:
    print(article.find("img").get("src"))

还有这个版本,它利用内联循环

from urllib.request import urlopen
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, "html")
images = [article.find("img").get("src") for article in bsObj.findAll("article", {"class": "post"})]

print(images)

lxml中有一种方法,它不是很好,但是如果使用xpath,您可以使用它轻松地找到元素,如果它们位于某些奇怪的地方:

from urllib.request import urlopen
from lxml import etree

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
lxmlHtml = etree.HTMLParser()

htmlPage = etree.parse(html, lxmlHtml)

images = htmlPage.xpath("//article[contains(@class, 'post') and not(contains(@class, 'page'))]//img")

for image in images:
    print(image.attrib["src"])
© www.soinside.com 2019 - 2024. All rights reserved.