从使用Power BI的网站刮取数据 - 从网站上的Power BI检索数据

Question

我想从这个页面中抓取数据（和类似的页面）：qazxsw poi

本页使用https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx。不幸的是，找到一种废弃Power BI的方法很难，因为每个人都希望废弃使用/进入Power BI，而不是使用它。最接近的答案是Power BI。然而无关紧要。

首先，我使用了this question，很快我意识到加载页面后表格数据已经加载。我需要页面的渲染版本。

因此，我使用Apache tika。我想在开始时发送Selenium（发送Select All组合键），但它不起作用。也许它受页面事件的限制（我也尝试使用开发人员工具删除所有事件，但仍然Ctrl+A不起作用。

我还尝试阅读HTML内容，但是Power BI使用Ctrl+A将div元素放在屏幕上，并且区分表中position:absolute的位置（行和列）是一项费力的活动。

由于Power BI使用JSON，我试图从那里读取数据。然而，它是如此复杂，我无法找到规则。它似乎把关键字放在某处并在表格中使用它们的索引。

注意：我意识到所有数据都没有加载，甚至同时显示。类div的div负责充当滚动条，并移动加载/显示数据的其他部分。

我用来读取数据的代码如下。如上所述，生成的数据的顺序与浏览器上呈现的顺序不同：

scroll-bar-part-bar

我很欣赏任何上述问题的解决方案。对我来说最有趣的是以JSON格式存储Power BI数据的惯例。

Answer 1

将滚动部分和JSON放在一边，我设法读取数据。关键是要读取父内部的所有元素（在问题中完成）：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

options = webdriver.ChromeOptions()
options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe"
driver = webdriver.Chrome(options=options, executable_path="C:/Drivers/chromedriver.exe")

driver.get("https://app.powerbi.com/view?r=eyJrIjoiYjVjM2MyNjItZDE1Mi00OWI1LWE5YWYtODY4M2FhYjU4ZDU1IiwidCI6ImExMmNlNTRiLTNkM2QtNDM0Ni05NWVmLWZmMTNjYTVkZDQ3ZCJ9")
parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')
values = [child.get_attribute('title') for child in children]

然后使用他们的位置对它们进

parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')

要对我们在不同行中读取的内容进行排序，此代码可能有所帮助：

x = [child.location['x'] for child in children]
y = [child.location['y'] for child in children]
index = np.lexsort((x,y))

从使用Power BI的网站刮取数据 - 从网站上的Power BI检索数据

问题描述投票：3回答：1

1个回答

最新问题

从使用Power BI的网站刮取数据 - 从网站上的Power BI检索数据

问题描述 投票：3回答：1

1个回答

最新问题

问题描述投票：3回答：1