我正在尝试从以下网站抓取数据: https://nfeweb.sefaz.go.gov.br/nfeweb/sites/nfe/consulta-completa
第1步
第2步 插入访问密钥后,我需要按“Pesquisar”按钮:
在本例中,我使用了以下访问密钥:
52241061585865236600650040001896941930530252
它返回以下页面:
https://nfeweb.sefaz.go.gov.br/nfeweb/sites/nfce/render/xml-consulta-completa?g-recaptcha-response=03AFcWeA7_oqqL4KubId8rW_TapI_NSJDOGBzrx_JB2XAtJitNaBl23zLKbjbj45m9eUZam3xp6R57BI47AI0lp_K3KS-CbtpPiTNAHqcxLV-Gnp2Vf778i3NeLMCKNoHpk7IitkwPHvHJjkg1sWRqdTZrHkhVHiMwFbTC4qFw6436ddwu9rRERxOiY532lIoijoHzDga85l7RvbHkyGUdWD7QVlTUNUU-2ztx21cQ_pDDQrxreDFEL8eCR0ijYAMrOtKEXMwqGSuHFTOSkZ83DCJ4S610YWujUukTXbOSdaAuGpeHljf4CsswFLWTKN8UoKTjlEia_I0cO17zgSnY9Z9rQDEZR1Xeq00CDmpbB73m95EOo0prSrL2RcsRnWkPytDIwJUIfsEAcEQ77vuacbNflj_yFpj2GSWVnGQnKXUrY4DsyRhNU6T6usZaYH5kTRb85qvrfm2FqOlgBfLDcvuwB_Q2JqRxyF6-oJlw64Sx2MZzUQC2gZjPtAIRwGCqOS80OkDkTmHZl9x3fM6tOr4fYM6BouHWrnjfyNz99O9bFcQv_bbdyREr1MVgJ6fujSZM6C7WoRJjwTv29kIuGc2l4nMkkilUU6rzK-apAYtgzSim_5T6N_zkvVQfOAo0mlKwjfVLVxCaWQYsGe5MfBe65ZmLVP_lIHnsJe_z0G9CMclmpaKTiynNEMtu_n8d6utw5ot6BHGp9OALHQq2_62hE_TTYMqVlrzugaPxMrTMKnGWd4W_kVPh-VqgqsKxdDW8xFXYtE8OM_WZNRg4m0ESnl4xW5NLZeZGu7onPt3jkw3vCt57YmdAgcHPpIhg0zPA7lNdBrY1zCeCM3edWoatnFng6irasc5R8fheSL2IS0lSUqCfN_cIuC6rYlPUGlU7pREqYe5ZTxHNkyI6GBvWM_pZSO4glw&chaveAcesso=52241061585865236600650040001896941930530252
第三步
在此阶段,我需要单击“Visualizar NFC-e Detalhada”才能最终到达包含我要抓取的数据的页面。
新路径变为:
https://nfeweb.sefaz.go.gov.br/nfeweb/sites/nfce/render/NFCe?chNFe=52241061585865236600650040001896941930530252
最后一步是点击“Produtos e Serviços”
错误
如果我尝试直接从链接访问网页:https://nfeweb.sefaz.go.gov.br/nfeweb/sites/nfce/render/NFCe?chNFe=52241061585865236600650040001896941930530252
如果我尝试通过 python 进行抓取,它会阻止我,并且我无法进行任何更多搜索,即使是通过网络浏览器也是如此。
我需要帮助来尝试抓取该特定页面中的数据,遵循所有步骤并绕过网站安全性的验证码和机器人阻止。
您可以使用 selenium 和 beautifulsoup4 的数据抓取方法
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
# WebDriverWait timeout in second
timeout = 10
# initialize Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless") # activate headless mode
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
# initialize web driver
driver = webdriver.Chrome(options=chrome_options)
# access target page
driver.get("https://nfeweb.sefaz.go.gov.br/nfeweb/sites/nfe/consulta-completa")
# using WebDriverWait
try:
# get the specific input element
input_box = WebDriverWait(driver, timeout).until(
EC.presence_of_element_located((By.ID, "chaveAcesso"))
)
# fill the input
input_box.send_keys("52241061585865236600650040001896941930530252")
# get the specific button for step 1
submit_button = WebDriverWait(driver, timeout).until(
EC.element_to_be_clickable((By.ID, "btnPesquisar"))
)
# click the button
submit_button.click()
# get the specific button for step 2
next_button = WebDriverWait(driver, timeout).until(
EC.element_to_be_clickable((By.CLASS_NAME, "btn-view-det"))
)
# click the button
next_button.click()
# get the specific element with id "tab_3" (Produtos e Serviços) tab
target_tab = WebDriverWait(driver, timeout).until(
EC.element_to_be_clickable((By.ID, "tab_3"))
)
# click the element
target_tab.click()
# You can use your beautifulsoup4 script to scrape data here...
except Exception as e:
print(f"Error: {e}")
finally:
# close the web driver
driver.quit()
上面的代码在无头模式下使用 Chrome Web 驱动程序。如果您想查看正在进行的过程,您可以更改行
driver = webdriver.Chrome(options=chrome_options)
到
driver = webdriver.Chrome()