我试图在subreddit页面上进行一些NLP。我有一大堆代码可以收集大量数据两个网页。它会刮擦数据,直到达到范围(40)。这很好,除了我知道我选择的subreddits比我的代码允许我刮更多的帖子。
谁能弄明白这里发生了什么?
posts_test = []
url = 'https://www.reddit.com/r/TheOnion/.json?after='
for i in range(40):
res = requests.get(url, headers={'User-agent': 'Maithili'})
the_onion = res.json()
for i in range(25):
post_t = []
post_t.append(the_onion['data']['children'][i]['data']['title'])
post_t.append(the_onion['data']['children'][i]['data']['subreddit'])
posts_test.append(post_t)
after = the_onion['data']['after']
url = 'https://www.reddit.com/r/TheOnion/.json?after=' + after
time.sleep(3)
# Not the onion
url = 'https://www.reddit.com/r/nottheonion/.json?after='
for i in range(40):
res3 = requests.get(url, headers=headers2)
not_onion_json = res2.json()
for i in range(25):
post_t = []
post_t.append(not_onion_json['data']['children'][i]['data']['title'])
post_t.append(not_onion_json['data']['children'][i]['data']['subreddit'])
posts_test.append(post_t)
after = not_onion_json['data']['after']
url = "https://www.reddit.com/r/nottheonion/.json?after=" + after
time.sleep(3)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-57-6c1cfdd42421> in <module>
7 for i in range(25):
8 post_t = []
----> 9 post_t.append(the_onion['data']['children'][i]['data']['title'])
10 post_t.append(the_onion['data']['children'][i]['data']['subreddit'])
11 posts_test.append(post_t)
IndexError: list index out of range"```
你在40停止的原因是因为你告诉python在40停止
for i in range(40):
好消息是你在这里收集下一页
after = not_onion_json['data']['after']
假设一旦你到达after == null
页面的末尾,我建议执行while循环。就像是
while after != None:
这将持续到你结束。