Python 3如何在特定域上抓取/爬取？

Question

我正在寻找所有网址/文本内容并在特定域上进行爬网。

我看过一种抓取网址（retrieve links from web page using python and BeautifulSoup）的方法

我还尝试了以下代码，以保留在特定域上，但似乎无法完全正常工作。

domains = ["newyorktimes.com", etc]
p = urlparse(url)
print(p, p.hostname)
if p.hostname in domains:
    pass
else:
    return []

#do something with p

我的主要问题是确保搜寻器停留在指定的域上，但是当URL可能具有不同的路径/片段时，我不确定如何执行此操作。我知道如何从给定的网站上抓取网址。我愿意使用BeautifulSoup，lxml，scrapy等

这个问题可能有点太广泛了，但是我已经尝试过在特定域内进行爬网搜索，但是我找不到太相关的东西：/

任何帮助/资源将不胜感激！

谢谢

Answer 1

尝试一下。

from simplified_scrapy.spider import Spider, SimplifiedDoc
class MySpider(Spider):
  name = 'newyorktimes.com'
  allowed_domains = ['newyorktimes.com','nytimes.com']
  # concurrencyPer1s=1
  start_urls = 'https://www.newyorktimes.com'
  refresh_urls = True # For debug. If efresh_urls = True, start_urls will be crawled again.

  def extract(self, url, html, models, modelNames):
    doc = SimplifiedDoc(html)
    lstA = doc.listA(url=url['url'])
    return {"Urls": lstA, "Data": None} # Return data to framework

from simplified_scrapy.simplified_main import SimplifiedMain
SimplifiedMain.startThread(MySpider()) # Start crawling

这里有更多示例：https://github.com/yiyedata/simplified-scrapy-demo/tree/master/spider_examples

Python 3如何在特定域上抓取/爬取？

问题描述投票：0回答：1

1个回答

最新问题

Python 3如何在特定域上抓取/爬取？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1