我是BeautifulSoup和Python的新手。因此,在此WP网站上,主页上有4篇文章,但它只给了我3篇文章,因此附有3张图片。有没有更简单的方法可以做到这一点?
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, features="html5lib")
articles = bsObj.findAll("article", {"class": "post"})
print(len(articles))
for article in articles:
image = bsObj.findAll("img", {"src": re.compile("/wp-content/uploads/.*.jpg")})
print(image)
现在您已经弄清了文章计数的问题,确实没有更简单的解决方案。如果您想签出,可能还有其他版本。
您的简化版代码为:
from urllib.request import urlopen
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, "html")
articles = bsObj.findAll("article", {"class": "post"})
for article in articles:
print(article.find("img").get("src"))
还有这个版本,它利用内联循环
from urllib.request import urlopen
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
bsObj = BeautifulSoup(html, "html")
images = [article.find("img").get("src") for article in bsObj.findAll("article", {"class": "post"})]
print(images)
lxml
中有一种方法,它不是很好,但是如果使用xpath
,您可以使用它轻松地找到元素,如果它们位于某些奇怪的地方:
from urllib.request import urlopen
from lxml import etree
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url="http://ionnews.mu", headers=headers)
html = urllib.request.urlopen(req)
lxmlHtml = etree.HTMLParser()
htmlPage = etree.parse(html, lxmlHtml)
images = htmlPage.xpath("//article[contains(@class, 'post') and not(contains(@class, 'page'))]//img")
for image in images:
print(image.attrib["src"])