Pandas read_html 只返回标题

问题描述 投票:0回答:2

以下代码找到表格但只返回表头,但是当我在浏览器中打开 url 时,它显示了很多行数据。

此外,如果有人知道如何与日期参数交互(见图),我将不胜感激:)。我尝试根据请求传递参数,但它不起作用。

df = pd.read_html('https://www.infomoney.com.br/cotacoes/b3/indice/ifnc/historico/')
df[0]

# returns:
DATA    ABERTURA    FECHAMENTO  VARIAÇÃO    MÍNIMO  MÁXIMO  VOLUME



# I tried with requests, but also got and empty dataframe:

url = 'https://www.infomoney.com.br/cotacoes/b3/indice/ifnc/historico/'
params = dict(page=0, numberItems=99999, initialDate = '01/01/2022', finalDate = '31/12/2022', 
              symbol='IFNC')
r = requests.post(url=url,data=params)
df = pd.read_html(r.text)
df[0]

pandas dataframe post python-requests-html
2个回答
1
投票

表格内容仅在您在浏览器中加载页面后可见。您应该使用带有 selenium 的网络驱动程序。您甚至可以通过 id/xpath 查找元素并与它们交互(输入文本,单击...):

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument("window-size=1920,1080")
browser = webdriver.Chrome(chrome_options=chrome_options)
url = 'https://www.infomoney.com.br/cotacoes/b3/indice/ifnc/historico/'

browser.get(url)

browser.find_element(By.ID, 'dateMin').send_keys("02/03/2022")
browser.find_element(By.ID, 'dateMax').send_keys("02/03/2023")

# wait until button is clickable
button = WebDriverWait(browser, 10).until(
        expected_conditions.element_to_be_clickable((By.ID, 'see_all_quotes_history'))
    )
button.click()

# wait until all table rows are visible
WebDriverWait(browser, 10).until(
        expected_conditions.visibility_of_all_elements_located((By.XPATH, "//tbody/tr/td[@class='sorting_1']"))
    )

df = pd.read_html(browser.page_source, attrs = {'id': 'quotes_history'})[0]

输出:

            DATA ABERTURA  FECHAMENTO  VARIAÇÃO  MÍNIMO  MÁXIMO VOLUME
0    01/03/2023    9.824       9.687      -140   9.564   9.832    n/d
1    01/03/2023    9.824       9.687      -140   9.564   9.832    n/d
2    28/02/2023    9.821       9.824         4   9.779   9.952    n/d
3    28/02/2023    9.821       9.824         4   9.779   9.952    n/d
4    27/02/2023    9.882       9.821       -62   9.792   9.927    n/d
..          ...      ...         ...       ...     ...     ...    ...
296  08/03/2022   10.045      10.037        -8   9.954  10.225    n/d
297  07/03/2022   10.390      10.045      -332  10.001  10.391    n/d
298  04/03/2022   10.624      10.390      -219  10.283  10.624    n/d
299  03/03/2022   10.515      10.623       103  10.475  10.748    n/d
300  02/03/2022   10.640      10.515      -118  10.447  10.703    n/d

[301 rows x 7 columns]

0
投票

你确定你在尝试请求时使用了正确的 url 吗?

按照相同的路径,检查浏览器检查器中的“网络”选项卡,将我带到以下 API 端点,与您的端点惊人地不同,而它应该正是您屏幕截图中 history 请求中指示的 URL?

import requests

payload = {'page':0, 'numberItems':50, 'symbol':'IFNC'}
r = requests.post('https://www.infomoney.com.br/wp-json/infomoney/v1/quotes/history', json=payload)
r.status_code

输出:

200

r.json()

输出:

[[{'display': '01/03/2023', 'timestamp': '1677628800'},
  '9.824',
  '9.687',
  '-1,40',
  '9.564',
  '9.832',
  'n/d'],
...
 [{'display': '19/12/2022', 'timestamp': '1671408000'},
  '9.341',
  '9.686',
  '3,69',
  '9.341',
  '9.737',
  'n/d']]

然后您肯定可以使用 pandas.json_normalize() 创建您的 DataFrame。

使用日期就像将这些字段添加到有效负载一样简单,例如:

from datetime import date

format = '%d/%m/%Y'
payload = {'page':0, 'numberItems':50, 'initialDate': date(2023, 1, 1).strftime(format), 'finalDate': date(2023, 1, 8).strftime(format), 'symbol':'IFNC'}
© www.soinside.com 2019 - 2024. All rights reserved.