使用Python BeautifulSoup进行Web Scraping时出错:从github配置文件中提取内容

问题描述 投票:0回答:2

这是使用BeautifulSoup库从github存储库中抓取内容的python代码。我面临错误:

“NoneType”对象没有属性'text'“

在这个简单的代码中。我在2行中面临错误,在代码中注释。

import requests 
from bs4 import BeautifulSoup 
import csv 

URL = "https://github.com/DURGESHBARWAL?tab=repositories"
r = requests.get(URL) 

soup = BeautifulSoup(r.text, 'html.parser') 

repos = []
table = soup.find('ul', attrs = {'data-filterable-for':'your-repos-filter'}) 

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text
    #First Error Position
        repo['desc'] = row.find('div').p.text
        #Second Error Postion
    repo['lang'] = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'}).text
    repos.append(repo) 

filename = 'extract.csv'
with open(filename, 'w') as f: 
    w = csv.DictWriter(f,['name','desc','lang'])
    w.writeheader() 
    for repo in repos: 
        w.writerow(repo)

OUTPUT

回溯(最近一次调用最后一次):文件“webscrapping.py”,第16行,在repo ['desc'] = row.find('div')。p.text AttributeError:'NoneType'对象没有属性'text'

python web-scraping beautifulsoup
2个回答
0
投票

发生这种情况的原因是当你通过BeautifulSoup找到元素时,它就像是一个dict.get()调用。当你去find元素时,它从元素树中gets。如果它找不到一个,而不是提高Exception,它返回NoneNone不具备Element将拥有的属性,如textattr等。因此,当您在没有Element.text或没有验证类型的情况下进行try/except调用时,您正在进行赌博,该元素将始终存在。

我可能只是首先保留在temp变量中给你问题的元素,这样你就可以键入check。无论是那个或实施try/except

Type Checking

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text


    p = row.find('div').p
    if p is not None:
        repo['desc'] = p.text
    else:
        repo['desc'] = None

    lang = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'})

    if lang is not None
        # Do something to pass here
        repo['lang'] = lang.text
    else:
        repo['lang'] = None
    repos.append(repo)

try/except

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text
    #First Error Position
    try:
        repo['desc'] = row.find('div').p.text
    except TypeError:
        repo['desc'] = None
        #Second Error Postion
    try:
        repo['lang'] = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'}).text
    except TypeError:
         repo['lang'] = None
    repos.append(repo)

我会倾向于尝试/除个人之外,因为它更简洁,异常捕获是一个很好的做法,可以提高程序的稳健性


0
投票

你的find调用是不准确和链接的,所以当你试图找到一个没有<div>孩子的p标签时,你会得到None,但是你继续在.text上调用None属性,它会用AttributeError崩溃你的程序。

尝试下面的一组.find调用,它们使用你所使用的itemProp属性并使用try-except块来null合并任何缺少的字段:

import requests 
from bs4 import BeautifulSoup 
import csv 

URL = "https://github.com/DURGESHBARWAL?tab=repositories"
r = requests.get(URL) 

soup = BeautifulSoup(r.text, 'html.parser') 

repos = []
table = soup.find('ul', attrs = {'data-filterable-for': 'your-repos-filter'}) 

for row in table.find_all('li', {'itemprop': 'owns'}): 
    repo = {
        'name': row.find('a', {'itemprop' : 'name codeRepository'}),
        'desc': row.find('p', {'itemprop' : 'description'}),
        'lang': row.find('span', {'itemprop' : 'programmingLanguage'})
    }

    for k, v in repo.items():
        try: 
            repo[k] = v.text.strip()
        except AttributeError: pass

    repos.append(repo)

filename = 'extract.csv'
with open(filename, 'w') as f: 
    w = csv.DictWriter(f,['name','desc','lang'])
    w.writeheader() 
    for repo in repos: 
        w.writerow(repo)

调试输出(除了书面CSV):

[   {   'desc': 'This a Django-Python Powered a simple functionality based '
                'Bot application',
        'lang': 'Python',
        'name': 'Sandesh'},
    {'desc': None, 'lang': 'Jupyter Notebook', 'name': 'python_notes'},
    {   'desc': 'Installing DSpace using docker',
        'lang': 'Java',
        'name': 'DSpace-Docker-Installation-1'},
    {   'desc': 'This Repo Contains the DSpace Installation Steps',
        'lang': None,
        'name': 'DSpace-Installation'},
    {   'desc': '(Official) The DSpace digital asset management system that '
                'powers your Institutional Repository',
        'lang': 'Java',
        'name': 'DSpace'},
    {   'desc': 'This Repo contain the DSpace installation steps with '
                'docker.',
        'lang': None,
        'name': 'DSpace-Docker-Installation'},
    {   'desc': 'This Repository contain the Intermediate system for the '
                'Collaboration and DSpace System',
        'lang': 'Python',
        'name': 'Community-OER-Repository'},
    {   'desc': 'A class website to share the knowledge and expanding the '
                'productivity through digital communication.',
        'lang': 'PHP',
        'name': 'class-website'},
    {   'desc': 'This is a POC for the Voting System. It is a precise '
                'design and implementation of Voting System based on the '
                'features of Blockchain which has the potential to '
                'substitute the traditional e-ballet/EVM system for voting '
                'purpose.',
        'lang': 'Python',
        'name': 'Blockchain-Based-Ballot-System'},
    {   'desc': 'It is a short describtion of Modern Django',
        'lang': 'Python',
        'name': 'modern-django'},
    {   'desc': 'It is just for the sample work.',
        'lang': 'HTML',
        'name': 'Task'},
    {   'desc': 'This Repo contain the sorting algorithms in C,predefiend '
                'function of C, C++ and Java',
        'lang': 'C',
        'name': 'Sorting_Algos_Predefined_functions'},
    {   'desc': 'It is a arduino program, for monitor the temperature and '
                'humidity from sensor DHT11.',
        'lang': 'C++',
        'name': 'DHT_11_Arduino'},
    {   'desc': "This is a registration from,which collect data from user's "
                'desktop and put into database after validation.',
        'lang': 'PHP',
        'name': 'Registration_Form'},
    {   'desc': 'It is a dynamic multi-part data driven search engine in '
                'PHP & MySQL from absolutely scratch for the website.',
        'lang': 'PHP',
        'name': 'search_engine'},
    {   'desc': 'It is just for learning github.',
        'lang': None,
        'name': 'Hello_world'}]
© www.soinside.com 2019 - 2024. All rights reserved.