[当我使用BeautifulSoup和regex（re.compile）中的findAll在python中进行网络抓取时，我无法使用CSS类正确循环它

Question

我是BeautifulSoup和Python的新手。因此，在此WP网站上，主页上有4篇文章，但它只给了我3篇文章，因此附有3张图片。有没有更简单的方法可以做到这一点？

import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, features="html5lib")
articles = bsObj.findAll("article", {"class": "post"})
print(len(articles))

for article in articles:
  image = bsObj.findAll("img", {"src": re.compile("/wp-content/uploads/.*.jpg")})
  print(image)

Answer 1

现在您已经弄清了文章计数的问题，确实没有更简单的解决方案。如果您想签出，可能还有其他版本。

您的简化版代码为：

from urllib.request import urlopen
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, "html")
articles = bsObj.findAll("article", {"class": "post"})

for article in articles:
    print(article.find("img").get("src"))

还有这个版本，它利用内联循环

from urllib.request import urlopen
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, "html")
images = [article.find("img").get("src") for article in bsObj.findAll("article", {"class": "post"})]

print(images)

lxml中有一种方法，它不是很好，但是如果使用xpath，您可以使用它轻松地找到元素，如果它们位于某些奇怪的地方：

from urllib.request import urlopen
from lxml import etree

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
lxmlHtml = etree.HTMLParser()

htmlPage = etree.parse(html, lxmlHtml)

images = htmlPage.xpath("//article[contains(@class, 'post') and not(contains(@class, 'page'))]//img")

for image in images:
    print(image.attrib["src"])

[当我使用BeautifulSoup和regex（re.compile）中的findAll在python中进行网络抓取时，我无法使用CSS类正确循环它

问题描述投票：-1回答：1

1个回答

最新问题

[当我使用BeautifulSoup和regex（re.compile）中的findAll在python中进行网络抓取时，我无法使用CSS类正确循环它

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1