从谷歌新闻RSS提要中提取网址

问题描述 投票:0回答:1

我有一个代码可以提取 google rss url,但它不起作用,看起来 google 已经更改了 url 的编码。

示例网址:

“https://news.google.com/rss/articles/CBMiuwFBVV95cUxPNXpRbXdHR3NzWHRlbk40d1I5OFVIajRHUHBXTUFGc3BMd1gxSEU1ZDlocGFWQXZhWEJYakROLUxQcTBZMElHN1VTdlN2eTV2LWFidnhiIOHZEVEYwLVhEalpx RFRXeGhXQlZoNEc4d1AzTWR3YULUZVAybjZWa2c4MU9kLWU2aEtmNlVnRy1OR3ZLcGd1M0NqVjFxeFRaOE9fWExpa1ZxSFpySnRkallHN3dFMm5nU1BIY18w0gHAAUFVX3lxTE5fZDdFSTQw ZGVzb3A1eUdIbzNIa0F0RmZlYUFmR3lPVnRZU09QU2hnelFpNngxVXI1aGlydWE1dzROcTRXSmw3a1dFZ0c1MDNROUU3enYzSFBPaEdpaHZUR0t1V2lpLWt5UEVEY01TbXRvM243U2p4ZTA3M lBtaU9XYmNqMU11QWdUVkFQRmk5RXJRb2Jwa3p0NUptUlpZanpNZVQxaVk2NkJVdU9kRmxlakVPOEx6ZFlIcG9oaWJMaA?oc=5"

之前运行的代码:


_ENCODED_URL_PREFIX = "https://news.google.com/rss/articles/"
_ENCODED_URL_RE = re.compile(fr"^{re.escape(_ENCODED_URL_PREFIX)}(?P<encoded_url>[^?]+)")
_DECODED_URL_RE = re.compile(rb'^\x08\x13".+?(?P<primary_url>http[^\xd2]+)\xd2\x01')


    @functools.lru_cache(2048)
    def _decode_google_news_url(self, url: str) -> str:
        match = _ENCODED_URL_RE.match(url)
        encoded_text = match.groupdict()["encoded_url"]  # type: ignore
        encoded_text += "==="  # Fix incorrect padding. Ref: https://stackoverflow.com/a/49459036/
        decoded_text = base64.urlsafe_b64decode(encoded_text)

        match = _DECODED_URL_RE.match(decoded_text)
        primary_url = match.groupdict()["primary_url"]  # type: ignore
        primary_url = primary_url.decode()
        return primary_url
python-3.x
1个回答
0
投票

如果这对任何人都有用,则由进行 GET 和检查重定向组成的“天真的”方法(例如使用 python 的请求

r = requests.get(url)
并检查
r.history
)会失败。

© www.soinside.com 2019 - 2024. All rights reserved.