我正在尝试从该网站抓取数据 https://data.anbima.com.br/debentures/AALM11/agenda?page=1&size=100& 当我查看 DevTools > Elements 时,它有一个 TABLE 标签TR 和 TD 标签内的数据(日期、值等),但是当我尝试使用 Selenium 或 bs4 解析 HTML 时,数据消失了,而是我看到一个
。我可以做什么来提取我需要的信息?
我的代码
deb = 'AALM11'
link_agenda = 'https://data.anbima.com.br/debentures/' + deb + '/agenda?page=1&size=100'
driver.get(link_agenda)
html_source = driver.find_element(By.TAG_NAME, 'table').get_attribute('outerHTML')
结果
<table id="" class="anbima-ui-table anbima-ui-table-responsive anbima-ui-table-mobile">
<thead>
<tr>
<th><span style="width: 80px;"><div class="skeleton-container" aria-hidden="true" style="width: 80px; height: 18px; margin-top: 0px;"></div></span></th>
<th><span style="width: 110px;"><div class="skeleton-container" aria-hidden="true" style="width: 100px; height: 18px; margin-top: 0px;"></div></span></th>
<th><span style="width: 110px;"><div class="skeleton-container" aria-hidden="true" style="width: 45px; height: 18px; margin-top: 0px;"></div></span></th>
<th><span style="width: 110px;"><div class="skeleton-container" aria-hidden="true" style="width: 90px; height: 18px; margin-top: 0px;"></div></span></th>
<th><span style="width: 110px;"><div class="skeleton-container" aria-hidden="true" style="width: 55px; height: 18px; margin-top: 0px;"></div></span></th>
<th><span style="width: 80px;"><div class="skeleton-container" aria-hidden="true" style="width: 45px; height: 18px; margin-top: 0px;"></div></span></th>
</tr>
</thead>
<tbody>
<tr>
<td><span><div class="skeleton-container" aria-hidden="true" style="width: 75px; height: 18px; margin-top: 0px;"></div></span></td>
<td><span><div class="skeleton-container" aria-hidden="true" style="width: 75px; height: 18px; margin-top: 0px;"></div></span></td>
<td><span><div class="skeleton-container" aria-hidden="true" style="width: 125px; height: 18px; margin-top: 0px;"></div></span></td>
<td><span><div class="skeleton-container" aria-hidden="true" style="width: 75px; height: 18px; margin-top: 0px;"></div></span></td>
<td><span><div class="skeleton-container" aria-hidden="true" style="width: 100px; height: 18px; margin-top: 0px;"></div></span></td>
<td><span><div class="skeleton-container" aria-hidden="true" style="width: 100px; height: 18px; margin-top: 0px;"></div></span></td>
</tr>
...
我本来希望看到这个
<table id="" class="anbima-ui-table anbima-ui-table-responsive agenda-ativo-page__table--liquidado-1 agenda-ativo-page__table--liquidado-2 agenda-ativo-page__table--liquidado-3 agenda-ativo-page__table--liquidado-4 agenda-ativo-page__table--liquidado-5 agenda-ativo-page__table--liquidado-6 agenda-ativo-page__table--liquidado-7 agenda-ativo-page__table--liquidado-8 agenda-ativo-page__table--liquidado-9 agenda-ativo-page__table--liquidado-10 ">
<thead>
<tr>
<th><span style="width: 80px;">Data do evento</span></th>
<th><span style="width: 110px;">Data de liquidação</span></th>
<th><span style="width: 110px;">Evento</span></th>
<th><span style="width: 110px;">Percentual / Taxa</span></th>
<th><span style="width: 110px;">Valor pago</span></th>
<th><span style="width: 80px;">Status</span></th>
</tr>
</thead>
<tbody>
<tr>
<td><span id="agenda-data-evento-0" class="normal-text">13/01/2022</span></td>
<td><span id="agenda-data-liquidacao-0" class="normal-text">13/01/2022</span></td>
<td><span id="agenda-evento-0" class="normal-text">Pagamento de juros</span></td>
<td><span id="agenda-taxa-0" class="normal-text">4,3500 %</span></td>
<td><span id="agenda-valor-0" class="normal-text">R$ 53,434259</span></td>
<td><span id="agenda-status-0" class="anbima-ui-flag anbima-ui-flag--small anbima-ui-flag--small--green " style="max-width: 96px;"><label class="flag__children">Liquidado</label></span></td>
</tr>
...
问题在于表数据是动态加载的。当浏览器加载页面时,它会向 Selenium 发出信号,表明页面已完成加载,但页面内容仍在后台加载。因此,您的代码将被执行并抓取部分加载的页面。要解决此问题,我们需要等待指示页面已完成加载的消息。我选择等待所有
<div class="skeleton-container" ...>
元素都消失。一旦这些都消失了,表数据加载就完成了,表数据就可用了。
工作代码...
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.maximize_window()
deb = 'AALM11'
link_agenda = 'https://data.anbima.com.br/debentures/' + deb + '/agenda?page=1&size=100'
driver.get(link_agenda)
wait = WebDriverWait(driver, 10)
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, "div.skeleton-container")))
table = driver.find_element(By.CSS_SELECTOR, "table")
print(table.get_attribute('outerHTML'))