Python网页抓取 - 几秒后访问HTML?

问题描述 投票:1回答:1

我正在使用Python访问此站点并刮取HTML:http://forum.toribash.com/tori_spy.php

如您所见,如果您访问该网页,则内容会在几秒钟内发生变化。这是一个页面,显示论坛上的最新帖子,我正在制作一个能够显示最新帖子的Discord机器人。

现在,它会在任何动画/更改发生之前显示该列表中的第一个帖子。

我想知道是否有办法让我跳过动画或让程序在访问之后等待几秒钟才能获取所有HTML。

当前代码:

    if message.content.startswith("-post"):
        await client.send_message(message.channel, ":arrows_counterclockwise: **Accessing forums...**")
        await client.send_typing(message.channel)
        time.sleep(5)
        #access site
        session_requests = requests.session()
        url = "http://forum.toribash.com/tori_spy.php"
        result = session_requests.get(url,headers = dict(referer = url))
        #access html
        tree = html.fromstring(result.content)

        list_stuff=[]
        for atag in tree.xpath("//strong/a"): #search for <strong><a>
            list_stuff.append(atag.text_content())
        await client.send_message(message.channel, ":white_check_mark: Last post was in the thread **"+list_stuff[0]+"**")

非常感谢!

python html python-3.x python-requests
1个回答
0
投票

页面使用ajax / xhr加载新帖子。它使用这样的URL

forum.toribash.com/vaispy.php?do=xml&last=9297850&r=0....

last是最后一条消息的id,你可以在一些HTML标签中找到highestid = 9297850;中的<script>r似乎并不重要 - 至少代码适用于没有r的我。

获得highestid后,您可以使用它来获取最新消息的XML

XML中,您可以将其ID视为<postid>,以便您可以在下一个请求中使用它。

import requests
from lxml import html

s = requests.session()

result = s.get("http://forum.toribash.com/tori_spy.php")
tree = html.fromstring(result.content)

for script in tree.xpath("//script"):
    if script.text and 'highestid' in script.text:
        highestid = script.text.split('\n')[3]
        highestid = highestid[13:-1]
        print('highestid:', highestid)

        result = s.get('http://forum.toribash.com/vaispy.php?do=xml&last='+highestid, headers=dict(referer=url))
        #print(result.text)
        data = html.fromstring(result.content)

        for item in data.xpath('.//event'):
            print('--- event ---')
            print('id:', item.xpath('.//id')[0].text)
            print('postid:', item.xpath('.//postid')[0].text)
            print(item.xpath('.//preview')[0].text)

目前的结果(你的可能会有所不同)

highestid: 9297873
--- event ---
id: 9297883
postid: 9297883
me vende esse full valkyrie por 18k
--- event ---
id: 9297881
postid: 9297881
Congratz Goat! Welcome to the team! :)
--- event ---
id: 9297879
postid: 9297879
Try to reset your email password, then attempt to do what I suggested.
--- event ---
id: 9297877
postid: 9297877
Hello Nope. Most of these bugs are known to currently cause issues and they are being worked on. People pinging and rejoining are bots that are being dealt with (it's just an extensive process to...
--- event ---
id: 9297874
postid: 9297874
Bon courage :)
© www.soinside.com 2019 - 2024. All rights reserved.