以下代码找到表格但只返回表头,但是当我在浏览器中打开 url 时,它显示了很多行数据。
此外,如果有人知道如何与日期参数交互(见图),我将不胜感激:)。我尝试根据请求传递参数,但它不起作用。
df = pd.read_html('https://www.infomoney.com.br/cotacoes/b3/indice/ifnc/historico/')
df[0]
# returns:
DATA ABERTURA FECHAMENTO VARIAÇÃO MÍNIMO MÁXIMO VOLUME
# I tried with requests, but also got and empty dataframe:
url = 'https://www.infomoney.com.br/cotacoes/b3/indice/ifnc/historico/'
params = dict(page=0, numberItems=99999, initialDate = '01/01/2022', finalDate = '31/12/2022',
symbol='IFNC')
r = requests.post(url=url,data=params)
df = pd.read_html(r.text)
df[0]
表格内容仅在您在浏览器中加载页面后可见。您应该使用带有 selenium 的网络驱动程序。您甚至可以通过 id/xpath 查找元素并与它们交互(输入文本,单击...):
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument("window-size=1920,1080")
browser = webdriver.Chrome(chrome_options=chrome_options)
url = 'https://www.infomoney.com.br/cotacoes/b3/indice/ifnc/historico/'
browser.get(url)
browser.find_element(By.ID, 'dateMin').send_keys("02/03/2022")
browser.find_element(By.ID, 'dateMax').send_keys("02/03/2023")
# wait until button is clickable
button = WebDriverWait(browser, 10).until(
expected_conditions.element_to_be_clickable((By.ID, 'see_all_quotes_history'))
)
button.click()
# wait until all table rows are visible
WebDriverWait(browser, 10).until(
expected_conditions.visibility_of_all_elements_located((By.XPATH, "//tbody/tr/td[@class='sorting_1']"))
)
df = pd.read_html(browser.page_source, attrs = {'id': 'quotes_history'})[0]
输出:
DATA ABERTURA FECHAMENTO VARIAÇÃO MÍNIMO MÁXIMO VOLUME
0 01/03/2023 9.824 9.687 -140 9.564 9.832 n/d
1 01/03/2023 9.824 9.687 -140 9.564 9.832 n/d
2 28/02/2023 9.821 9.824 4 9.779 9.952 n/d
3 28/02/2023 9.821 9.824 4 9.779 9.952 n/d
4 27/02/2023 9.882 9.821 -62 9.792 9.927 n/d
.. ... ... ... ... ... ... ...
296 08/03/2022 10.045 10.037 -8 9.954 10.225 n/d
297 07/03/2022 10.390 10.045 -332 10.001 10.391 n/d
298 04/03/2022 10.624 10.390 -219 10.283 10.624 n/d
299 03/03/2022 10.515 10.623 103 10.475 10.748 n/d
300 02/03/2022 10.640 10.515 -118 10.447 10.703 n/d
[301 rows x 7 columns]
你确定你在尝试请求时使用了正确的 url 吗?
按照相同的路径,检查浏览器检查器中的“网络”选项卡,将我带到以下 API 端点,与您的端点惊人地不同,而它应该正是您屏幕截图中 history 请求中指示的 URL?
import requests
payload = {'page':0, 'numberItems':50, 'symbol':'IFNC'}
r = requests.post('https://www.infomoney.com.br/wp-json/infomoney/v1/quotes/history', json=payload)
r.status_code
输出:
200
r.json()
输出:
[[{'display': '01/03/2023', 'timestamp': '1677628800'},
'9.824',
'9.687',
'-1,40',
'9.564',
'9.832',
'n/d'],
...
[{'display': '19/12/2022', 'timestamp': '1671408000'},
'9.341',
'9.686',
'3,69',
'9.341',
'9.737',
'n/d']]
然后您肯定可以使用 pandas.json_normalize() 创建您的 DataFrame。
使用日期就像将这些字段添加到有效负载一样简单,例如:
from datetime import date
format = '%d/%m/%Y'
payload = {'page':0, 'numberItems':50, 'initialDate': date(2023, 1, 1).strftime(format), 'finalDate': date(2023, 1, 8).strftime(format), 'symbol':'IFNC'}