我需要站点地图中的提取链接 https://wunder.com.tr/sitemap.xml
我写了一些代码
import requests
from bs4 import BeautifulSoup
wunder = requests.get("https://wunder.com.tr/sitemap.xml")
parcala = BeautifulSoup(wunder.content,"lxml")
links = parcala.find_all("html-tag")
print(links)
但无法提取。
import requests
from bs4 import BeautifulSoup
wunder = requests.get("https://wunder.com.tr/sitemap.xml")
parcala = BeautifulSoup(wunder.content, "xml")
urls_from_xml = []
loc_tags = parcala.find_all('loc')
for loc in loc_tags:
urls_from_xml.append(loc.get_text())
print(urls_from_xml)
使用 lxml 模块的另一个解决方案:
import lxml.etree
import requests
def get_urls_from_sitemap(sitemap_xml_url: str) -> list[str]:
response = requests.get(sitemap_xml_url)
sitemap_content_parsed = lxml.etree.fromstring(response.content)
sitemap_content = sitemap_content_parsed
urls = [str(element.text) for element in sitemap_content.iter() if element.tag.endswith("loc")]
return urls
urls = get_urls_from_sitemap("https://example.com/sitemap.xml")
print(urls)