Python Extract href问题

问题描述 投票:1回答:1

我正试图从网址获取所有href。问题是我无法提取写一个href:

<a href="#!DetalleNorma/203906/20190322" title="" data-bind="html: organismo, attr: {href: $root.crearHrefDetalleNorma(idTamite,fechaPublicacion)} ">SECRETARÍA GENERAL</a>

我只能提取的是:#!

from bs4 import BeautifulSoup
import urllib.request as urllib2
import re

html_page = urllib2.urlopen('https://www.boletinoficial.gob.ar/')
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    print link.get('href')

这是解析。它也不起作用:

import requests
from lxml import html
from bs4 import BeautifulSoup

r = requests.get('https://www.boletinoficial.gob.ar/')
soup = BeautifulSoup(r.content, "html.parser")

for td in soup.findAll("div", class_="itemsection"):
    for a in td.findAll("a", href=True):
        print(a.text)
python web-scraping
1个回答
1
投票

我不得不在等待条件下使用硒

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get('https://www.boletinoficial.gob.ar/')
links =  [item.get_attribute('href') for item in WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".itemsection [href]")))]
print(links)

文本和链接作为元组

data =  [(item.get_attribute('href'), item.text) for item in WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".itemsection [href]")))]
print(data)
© www.soinside.com 2019 - 2024. All rights reserved.