我正在抓取一个具有无限滚动的网站,并且想知道最好的方法是什么:
选项 1:刮擦和滚动(重复)
问题:
选项 2:滚动和刮擦(全部)
问题:
我已经成功编写了“选项 2”的代码,并且很好奇选项 1 是否也能工作,以及优点/缺点。
谢谢。
我尝试了选项 2,它有效:
选项 2:滚动和刮擦(全部)
补充信息:
滚动功能。
def scroll_to_bottom(driver):
# Scroll to the bottom of the page using JavaScript
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # Adjust the sleep time as needed
提取数据的功能:
def Site_Extract_Cal_Event(local_webdriver: webdriver, local_event_webdriver: webdriver, url_site: str, iteration: int):
global df
local_webdriver.get(url_site)
logger.debug('BKRM: Extract Data from current page')
logger.debug('BKRM: Extract URLs')
# Sroll the infinite loop a few times
for index in range(1,iteration):
logger.debug('BKRM: scrolling [' + str(index) + ']')
scroll_to_bottom(local_webdriver)
# Search for WebElement for each event
xpath_event = '//*[starts-with(@id, "ep-")]'
div_elements = local_webdriver.find_elements("xpath",xpath_event)
element_index = 1
for element in div_elements:
element_index_text = "{:02d}".format(element_index)
logger.debug("ELEMENT [" + element_index_text + "] : ")
# Search for time of event
xpath_time = ".//time"
run_time = element.find_element("xpath",xpath_time)
logger.debug("BKRM: " + "Time: " + run_time.text)
# Search for URL of full event description
# Used to extract organiser
xpath_meetup_url_link = ".//a[@class='flex h-full flex-col justify-between space-y-5 outline-offset-8 hover:no-underline']"
run_meetup_url_link = element.find_element("xpath",xpath_meetup_url_link)
run_meetup_url_link_text = run_meetup_url_link.get_attribute("href")
logger.debug("BKRM: " + "Meetup url link: " + run_meetup_url_link_text)
# Search for Event Title
xpath_title = './/span[@class="ds-font-title-3 block break-words leading-7 utils_cardTitle__lbnC_ text-gray6"]'
run_title = element.find_element("xpath",xpath_title)
logger.debug("BKRM: " + "Title: " + run_title.text)
# Search for attendee number
# in some event, no attendee is written
xpath_attendee_number = ".//span[@class='hidden sm:inline']"
try:
run_attendee_number = element.find_element("xpath",xpath_attendee_number)
run_attendee_number_text_temp= run_attendee_number.text
an_text = run_attendee_number_text_temp.split()
run_attendee_number_text = an_text[0]
logger.debug("BKRM: " + "Attendee Number: " + run_attendee_number.text)
except NoSuchElementException:
run_attendee_number_text= "0"
# Search for organizer name from details event description
run_organizer = Site_Extract_Event_Details(local_event_webdriver, run_meetup_url_link_text)
# run_organizer = "BKK RUNNERS"
element_index = element_index +1
# Create a new record to add
new_record = {'event_site':BKRM_SITE,
'event_date': run_time.text,
'event_title': run_title.text,
'event_organizer': run_organizer,
'event_attendee_number': run_attendee_number_text,
'event_url': run_meetup_url_link_text}
# Append the new record to the DataFrame
df = df.append(new_record, ignore_index=True)
logger.debug('Adding Event: ' + run_time.text + " / " + run_title.text)
我可以抓取数据两次吗?
这完全基于我们尝试抓取的网站,有些网站仅显示唯一数据,有些网站也显示重复数据。因此,最好在爬行时通过重复数据删除作为后处理来获取独特的结果。
我会丢失数据吗?
它将基于我们正在进行的滚动数量,如果我们进行有限数量的滚动并获取数据,则丢失数据的可能性会较小。如果我们进行多次滚动并获取数据,我们的硒浏览器可能会在一段时间后受到攻击,并且有可能丢失数据。
我正在抓取一个具有无限滚动的网站,并且想知道最好的方法是什么:
如果您有可能不使用硒,那么最好尝试识别当我们滚动并发出这些请求时它们正在执行的后台请求。为什么我不建议使用 selenium 是因为可能会出现浏览器故障或网络故障,并且大规模管理 selenium 也很麻烦。当我们多次滚动时,浏览器就有可能受到攻击。如果我们知道后台请求,我们可以尝试一次执行一个请求。
为了找到后台请求,我们需要进行网络调试,并尝试找出所需数据来自哪里的请求。
由于目标网站尚未披露,无法提供有关后台请求的更多信息。