使用Python提取站点地图中的URL

问题描述 投票:0回答:2

我需要站点地图中的提取链接 https://wunder.com.tr/sitemap.xml

我写了一些代码

import requests
from bs4 import BeautifulSoup

wunder = requests.get("https://wunder.com.tr/sitemap.xml")
parcala = BeautifulSoup(wunder.content,"lxml")

links = parcala.find_all("html-tag")
print(links)

但无法提取。

python python-3.x beautifulsoup request
2个回答
1
投票
import requests
from bs4 import BeautifulSoup

wunder = requests.get("https://wunder.com.tr/sitemap.xml")
parcala = BeautifulSoup(wunder.content, "xml")

urls_from_xml = []

loc_tags = parcala.find_all('loc')

for loc in loc_tags:
    urls_from_xml.append(loc.get_text()) 
   
print(urls_from_xml)

0
投票

使用 lxml 模块的另一个解决方案:

import lxml.etree
import requests

def get_urls_from_sitemap(sitemap_xml_url: str) -> list[str]:
    response = requests.get(sitemap_xml_url)
    sitemap_content_parsed = lxml.etree.fromstring(response.content)
    sitemap_content = sitemap_content_parsed
    urls = [str(element.text) for element in sitemap_content.iter() if element.tag.endswith("loc")]
    return urls

urls = get_urls_from_sitemap("https://example.com/sitemap.xml")
print(urls)
© www.soinside.com 2019 - 2024. All rights reserved.