我能够抓取两个reddit页面,直到某一点,然后我收到一个错误,我不明白为什么

问题描述 投票:0回答:1

我试图在subreddit页面上进行一些NLP。我有一大堆代码可以收集大量数据两个网页。它会刮擦数据,直到达到范围(40)。这很好,除了我知道我选择的subreddits比我的代码允许我刮更多的帖子。

谁能弄明白这里发生了什么?

posts_test = []
url = 'https://www.reddit.com/r/TheOnion/.json?after='
for i in range(40):
    res = requests.get(url, headers={'User-agent': 'Maithili'})
    the_onion = res.json()
    for i in range(25):
        post_t = []
        post_t.append(the_onion['data']['children'][i]['data']['title'])
        post_t.append(the_onion['data']['children'][i]['data']['subreddit'])
        posts_test.append(post_t)
    after = the_onion['data']['after']
    url = 'https://www.reddit.com/r/TheOnion/.json?after=' + after

    time.sleep(3)

# Not the onion
url = 'https://www.reddit.com/r/nottheonion/.json?after='

for i in range(40):
    res3 = requests.get(url, headers=headers2)
    not_onion_json = res2.json()
    for i in range(25):
        post_t = []
        post_t.append(not_onion_json['data']['children'][i]['data']['title'])
        post_t.append(not_onion_json['data']['children'][i]['data']['subreddit'])
        posts_test.append(post_t)
    after = not_onion_json['data']['after']
    url = "https://www.reddit.com/r/nottheonion/.json?after=" + after

    time.sleep(3)


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-57-6c1cfdd42421> in <module>
      7     for i in range(25):
      8         post_t = []
----> 9         post_t.append(the_onion['data']['children'][i]['data']['title'])
     10         post_t.append(the_onion['data']['children'][i]['data']['subreddit'])
     11         posts_test.append(post_t)

IndexError: list index out of range"```
python web-scraping nlp
1个回答
1
投票

你在40停止的原因是因为你告诉python在40停止

for i in range(40):

好消息是你在这里收集下一页

after = not_onion_json['data']['after']

假设一旦你到达after == null页面的末尾,我建议执行while循环。就像是

while after != None:

这将持续到你结束。

© www.soinside.com 2019 - 2024. All rights reserved.