我是 python 和爬虫的新手,需要帮助来理解我尝试从起始 URL 获取的每个链接上发生的以下错误:['https://www.eskom.co.za/category /新闻/]
2024-01-31 17:02:42 [py.warnings] WARNING: C:\Users\27671\PycharmProjects\Web crawling\venv\Lib\site-packages\scrapy\spidermiddlewares\offsite.py:74: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry
https://www.eskom.co.za/2023/02/ in allowed_domains.
warnings.warn(message, URLWarning)
我本想爬进 2022 年的媒体声明,并为一个小项目抓取每个声明的描述。
下面是爬虫的原代码
#import libraries
from bs4 import BeautifulSoup as bs
import requests
import re
#to crawl extracted hyperlinks
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#requesting with get from url, assign to output which is response
response = requests.get('https://www.eskom.co.za/category/news/')
#assign results to variable from response
soup = bs(response.text, 'html.parser')
#find all a tags for hyperlinks
archives = soup.find_all('a')
# for loop href for all links
hrefs = []
for link in archives:
hrefs.append((link.get('href')))
# assign class for crawler
class Crawler(CrawlSpider):
name = 'link_crawler'
allowed_domains = hrefs
start_url = ['https://www.eskom.co.za/2022/07/']
rules = (
Rule(LinkExtractor(allow='2022'), callback='parse_item'),
)
# define parse method for results, yield for scraping from linked page(media statements)
def parse_item(self, response):
yield {
'description': response.css('entry-content-wrap h2::text').get()
}
from scrapy.crawler import CrawlerProcess
class Crawler(CrawlSpider):
name = 'link_crawler'
allowed_domains = hrefs
start_urls = ['https://www.eskom.co.za/category/news/']
rules = (
Rule(LinkExtractor(allow='2022'), callback='parse_item'),
)
def parse_item(self, response):
yield {
'description': response.css('entry-content-wrap h2::text').get()
}
使用urlparse从url中提取域并将其添加到允许的域列表中
导入urlparse
爬虫类(CrawlSpider):
name = "link crawler"
def add_dmn(self):
url = "your url"
dmn = urlparse.urlparse(url).netloc
self.allowed_domains = [dmn]
yield scrapy.Request(url=url, callback=self.parse)