使用BeautifulSoup进行基本的Python Web抓取

Question

我对编码很新，最近我开始研究网络抓取。我一直在关注this tutorial并阅读BS4文档，但我不明白为什么我的代码无效。

我正在尝试使用webscraper提取this post's headline，但看起来它找不到任何匹配“（'div'，class _ ='header'）的标签”

我的代码：

import requests
from bs4 import BeautifulSoup

SOURCE = requests.get('http://coreyms.com/').text
SOUP = BeautifulSoup('SOURCE', 'lxml')

HEADER = SOUP.find('div', class_='header')
HEADLINE = HEADER.h2.a.href

print(HEADLINE)

错误信息：

Traceback (most recent call last):
   File "WSCoreySchafer.py", line 10, in <module>
    HEADLINE = ARTICLE.h2.a.href
AttributeError: 'NoneType' object has no attribute 'h2'

Answer 1

这条线：

SOUP = BeautifulSoup('SOURCE', 'lxml')

尝试从字符串'SOURCE'创建一个汤对象，而不是从存储在变量SOURCE中的值创建。

你也在寻找html中的错误元素，你不想要<div>与class="header"，你实际上在寻找一个<header>元素（本页有多个）。我实际上建议用<h2>寻找class="entry-title"元素，你可以这样做：

import requests
from bs4 import BeautifulSoup

SOURCE = requests.get('http://coreyms.com/').text
SOUP = BeautifulSoup(SOURCE, 'lxml')

HEADER = SOUP.find('h2', class_='entry-title')
headline_href = HEADER.a['href']
print(headline_href)

打印

http://coreyms.com/development/best-sublime-text-features-and-shortcuts

使用BeautifulSoup进行基本的Python Web抓取

问题描述投票：0回答：1

1个回答

最新问题

使用BeautifulSoup进行基本的Python Web抓取

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1