Scrapy Spider 在数百个正确抓取的元素中未找到 2 个 Div 元素

Question

我在 Jupyternotebook 中使用 Scrapy 来抓取黄页网站，但遇到了一个奇怪的错误。

当用户输入各种邮政编码的“auto”时，我的代码会抓取黄页的列表视图。虽然我的代码能够正确抓取几乎所有的企业列表（大约 500 个），但有 2 个我无法让我的 Spider 抓取。

这是 2 家企业：Roger's Services 和 Northeastern Bus Rebuilders。（有关它们在网站上的显示方式，请参见下面的屏幕截图）。我检查了网站的 html，并且包含我要抓取的信息的 div 的结构似乎与任何其他容易抓取的 div 没有明显不同。我遇到问题的网页可以在[此处](https://www.yellowpages.com/search? search_terms=自动&geo_location_terms=02136）。请注意，此页面和后续页面上的所有商家均已正确抓取。只有 2 个，特别是在这个页面上，似乎没有被蜘蛛“看到”。

我尝试通过 ID 获取包含我想要的 div 和元素的 div 来抓取信息，但无济于事。

我尝试过使用 css 选择器和 xpath。请参阅下面的代码以了解我的尝试。我能够使用 css 和 xpath 访问 div.results lit-XXXX 元素，成功获取位于 Roger 服务正上方的业务结果容器 div，但是当我输入 Roger 服务 ID 时，我什么也没得到。

对于为什么会发生这种情况有什么建议吗？我真的很茫然，因为商家列表之间的 html 看起来完全相同，而且只有这两个似乎引起了问题。

请告诉我您可以采取什么措施来解决这个问题！谢谢！

这是我的解析代码，其中包含我尝试调试的打印语句：

def parse(self, response):
        
        ### DEBUGGING SECTION BEGIN ###

        div_pointer = response.css('div#lid-5671823')
        #div_pointer = response.xpath('//div[contains(@id, "5671823")]')
        div_content = div_pointer.get()
        if(div_content!=None):
            print('The div containing Roger Service')
            print(div_content)
            print('Trying to get the stuff inside:')
            div_inside = div_pointer.css('div.info-section.info-primary')
            print(div_inside)

        ### DEBUGGING SECTION END ###

        containing_divs = response.css('div.info-section.info-primary') # works fine for everything else 

        for containing_div in containing_divs:
            business_name = containing_div.css('a.business-name span::text').get()
            if (business_name==None):  # note, paid listings are slightly differently coded in the html, so error handling below:
                business_name = containing_div.css('a.business-name::text').get()
                href = containing_div.css('a.business-name::attr(href)').get()
                end_of_href = href[href.rfind('-')+1:] # returns ypid-?lid=....
                ypid = end_of_href[:end_of_href.rfind('?lid')]# substring to return only ypid
            else:
                href = containing_div.css('a.business-name::attr(href)').get()
                ypid = href[href.rfind('-')+1:] #the ypid is the end of the link (href), so substring the link from the last - to the end
        
            categories = containing_div.css('div.categories a::text').extract()
            
            if(ypid.rfind('?lid')!=-1): # Clean ypid, in case (some hrefs of non paid include ?... etc. at end)
                end_of_href = href[href.rfind('-')+1:] # returns ypid-?lid=....
                ypid = end_of_href[:end_of_href.rfind('?lid')]# substring to return only ypid

            yield {
               'business_name':business_name,
                'ypid':ypid,
                'categories':categories
            }

Answer 1

我找不到您的代码的问题，但您可以使用

BeautifulSoup

更轻松地做到这一点：

import requests
from bs4 import BeautifulSoup

url = 'https://www.yellowpages.com/search?search_terms=auto&geo_location_terms=02136'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

data = [{
    'business_name': r.select_one('a.business-name').text, 
    'ypid': r.select_one('a.business-name').get('href').rsplit('-', 1)[-1].rsplit('?lid=', 1)[0],
    'categories': [a.text for a in r.select('div.categories > a')] 
} for r in soup.select('div.result')]

print(data)

数据也可在

script[type="application/ld+json"]

中找到，但不包括付费列表或类别：

import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.yellowpages.com/search?search_terms=auto&geo_location_terms=02136'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script = soup.find('script', type='application/ld+json', string=lambda txt: 'LocalBusiness' in txt)
data = json.loads(script.text)
print(data)

您可以像这样添加类别：

categories = {r.select_one('a.business-name').text: [a.text for a in r.select('div.categories > a')] for r in soup.select('div.result')}
for d in data:
    d['categories'] = categories.get(d['name'], [])

Scrapy Spider 在数百个正确抓取的元素中未找到 2 个 Div 元素

问题描述投票：0回答：1

1个回答

最新问题

Scrapy Spider 在数百个正确抓取的元素中未找到 2 个 Div 元素

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1