抓取具有相同网址的分页表

问题描述 投票:0回答:1

我正在尝试从“https://tariffs.ib-net.org/sites/IBNET/TariffTable#”中抓取分页表,其中所有页面的 url 保持相同。

我已经参考以下页面寻求帮助,但无法找到解决方案 -

https://stackoverflow.com/questions/70180819/scrape-a-table-data-from-a-pagulated-webpage-where-the-url-does-not-change-but-t

https://stackoverflow.com/questions/42635015/using-selenium-to-scrape-a-table-across-multiple-pages-when-the-url-doesnt-chan

https://stackoverflow.com/questions/73362475/using-selenium-to-scrape-pagulated-table-data-python

https://stackoverflow.com/questions/75479097/scrape-multiple-pages-with-the-same-url-using-python-selenium

很有趣的是,从这里“抓取许多页面但相同 URL 的动态数据表”我们可以提取 json 文件,但我不知道如何做到这一点。

任何帮助将不胜感激,

我的尝试:

代码1:

from lxml import html
import requests
import pandas as pd
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from openpyxl import load_workbook
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
#from .actions.wheel_input import ScrollOrigin
from win32com.client import Dispatch
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import urllib
import requests

options = Options()
# options.add_argument('--headless')
#options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(options=options)

url = 'https://tariffs.ib-net.org/sites/IBNET/TariffTable#'
driver.get(url)
time.sleep(10)
wait = WebDriverWait(driver, 10)
x=driver.find_element(By.XPATH,'//*[@id="datatab_length"]/label/select')
drop=Select(x)

drop.select_by_visible_text("100")
time.sleep(10)

table = wait.until(EC.presence_of_element_located((By.XPATH, "//table[@class='table table-striped table-hover table-bordered dataTable no-footer dtr-inline collapsed']")))

#utility=[]
#city=[]
#service=[]
#date=[]
#fifteenm3=[]
#fiftym3 = []
#hundredm3 = []
data_list=[]
while True:
    # Extract data from the current page
    rows = table.find_elements(By.XPATH, "//table[@class='table table-striped table-hover table-bordered dataTable no-footer dtr-inline collapsed']//tbody")
    
    for row in rows:
        columns = row.find_elements(By.TAG_NAME, "tr")
        data_list.append([col.text.strip() for col in columns])
        print(data_list)
        next_button = driver.find_element(By.XPATH, "//*[@class='paginate_button next']/a")
        if next_button:
             # Click the next page button
            next_button.click()
            time.sleep(10)
            continue
        else:
            break 

代码2:

from lxml import html
import requests
import pandas as pd
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from openpyxl import load_workbook
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
from win32com.client import Dispatch
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
import urllib
import requests


options = webdriver.ChromeOptions() 
options.add_argument('--start-maximized')
options.add_argument("disable-gpu")
browser = webdriver.Chrome()
browser.maximize_window()
actions=ActionChains(browser)
browser.get("https://tariffs.ib-net.org/sites/IBNET/TariffTable#")
time.sleep(5)

table_header= browser.find_elements(By.XPATH,"//table[@id='datatab']/thead")
header_row = []
for header in table_header:
    header_row.append(header.text)
#print(header_row)

utility=[]
city=[]
service=[]
date=[]
fifteenm3=[]
fiftym3=[]
hundredm3=[]

while True:
    all_rows = browser.find_elements(By.XPATH,"//div[@class='row']//tbody")
    for index in range(len(all_rows)):
        all_columns = all_rows[index].find_elements(By.XPATH,"//*[@role='row']")
        utility.append(all_columns[0].text)
        print(utility)
        city.append(all_columns[1].text)
        print(city)
        service.append(all_columns[2].text)
        print(service)
        date.append(all_columns[3].text)
        print(date)
        fifteenm3.append(all_columns[4].text)
        print(fifteenm3)
        fiftym3.append(all_columns[5].text)
        print(fiftym3)
        hundredm3.append(all_columns[6].text)
        print(hundredm3)
    if browser.find_element(By.XPATH,"//*[@class='paginate_button next']/a"):
        browser.find_element(By.XPATH,"//*[@class='paginate_button next']/a").click()
        time.sleep(5)
        continue
    else:
        break
        
df=pd.DataFrame()
df['Utlity']=utility
df['service']=service
df['city']=city
df['date']=date
df['15m3']=fifteenm3
df['50m3']=fiftym3
df['100m3']=hundredm3

df.to_csv('data.csv')

上面的代码要么运行几页,然后超时,即使在中间让代码休眠一段时间,或者在正常运行一段时间后继续循环同一页面。

python-3.x selenium-webdriver web-scraping selenium-chromedriver
1个回答
0
投票

抓取多个页面但相同 URL 的动态数据表中提到的答案适用于此。在较高的层面上,要获取准确的 URL 和值,您需要执行以下操作:

  1. 转到开发者控制台中的网络选项卡。清除所有当前日志。
  2. 单击网页上表格的下一页,在网络选项卡中找到浏览器发送的请求。
  3. 右键单击请求并将请求复制为 fetch 或 curl(无论您喜欢什么)。
  4. 现在您已经有了请求及其标头信息,请转到您的代码并粘贴请求(当然,您必须转换请求以使用您语言中可用的库)。
  5. 更改请求的参数以获得所需的行为。在这种情况下,如果参数长度足够大,您应该获取所有数据。如果这太慢或不可靠,您可以以 100 为单位批量获取条目,同时重试导致请求被丢弃的页面。

这是来自我的浏览器的请求的Python代码:https://pym.dev/p/2mbnx/。您可以简化这一点,因为不需要发送很多此类信息。

我不熟悉您对开发工具的专业水平,因此,如果您在此处的任何/所有步骤中遇到困难,请在此答案下的评论中讨论。

© www.soinside.com 2019 - 2024. All rights reserved.