许多 Facebook 粉丝页面现在采用以下格式 - https://www.facebook.com/TiltedKiltEsplanade 其中“TiltedKiltEsplanade”是页面所有者声明的名称的示例。但是,可以在 https://www.facebook.com/feeds/page.php?id=414117051979234&format=rss20 找到同一页面的 RSS 提要,其中 414117051979234 是可以通过访问 https:// 确定的 ID graph.facebook.com/TiltedKiltEsplanade 并查找页面上列出的最后一个数字 ID(页面顶部有两个外观相似的 ID,但可以忽略它们)。
我有一长串采用上述格式的 Facebook 粉丝页面列表,我想快速获取与这些页面相对应的数字 ID,以便我可以将它们全部添加到 RSS 阅读器中。抓取这些页面的最简单方法是什么?我熟悉 Scrapy,但我不确定它是否可以使用,因为页面的图形版本没有以允许轻松抓取的方式标记(据我所知)
谢谢。
图形请求的输出是一个 JSON 对象。这比 HTML 内容更容易处理。
这将是您正在寻找的内容的简单实现:
# file: myspider.py
import json
from scrapy.http import Request
from scrapy.spider import BaseSpider
class MySpider(BaseSpider):
name = 'myspider'
start_urls = (
# Add here more urls. Alternatively, make the start urls dynamic
# reading them from a file, db or an external url.
'https://www.facebook.com/TiltedKiltEsplanade',
)
graph_url = 'https://graph.facebook.com/{name}'
feed_url = 'https://www.facebook.com/feeds/page.php?id={id}&format=rss20'
def start_requests(self):
for url in self.start_urls:
# This assumes there is no trailing slash
name = url.rpartition('/')[2]
yield Request(self.graph_url.format(name=name), self.parse_graph)
def parse_graph(self, response):
data = json.loads(response.body)
return Request(self.feed_url.format(id=data['id']), self.parse_feed)
def parse_feed(self, response):
# You can use the xml spider, xml selector or the feedparser module
# to extract information from the feed.
self.log('Got feed: %s' % response.body[:100])
输出:
$ scrapy runspider myspider.py
2014-01-11 02:19:48-0400 [scrapy] INFO: Scrapy 0.21.0-97-g21a8a94 started (bot: scrapybot)
2014-01-11 02:19:48-0400 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-01-11 02:19:48-0400 [scrapy] DEBUG: Overridden settings: {}
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Enabled item pipelines:
2014-01-11 02:19:49-0400 [myspider] INFO: Spider opened
2014-01-11 02:19:49-0400 [myspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-11 02:19:49-0400 [myspider] DEBUG: Crawled (200) <GET https://graph.facebook.com/TiltedKiltEsplanade> (referer: None)
2014-01-11 02:19:50-0400 [myspider] DEBUG: Crawled (200) <GET https://www.facebook.com/feeds/page.php?id=414117051979234&format=rss20> (referer: https://graph.facebook.com/TiltedKiltEsplanade)
2014-01-11 02:19:50-0400 [myspider] DEBUG: Got feed: <?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
xmlns:media="http://search.yahoo.com
2014-01-11 02:19:50-0400 [myspider] INFO: Closing spider (finished)
2014-01-11 02:19:50-0400 [myspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 578,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 6669,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 1, 11, 6, 19, 50, 849162),
'log_count/DEBUG': 9,
'log_count/INFO': 3,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 1, 11, 6, 19, 49, 221361)}
2014-01-11 02:19:50-0400 [myspider] INFO: Spider closed (finished)