我正在尝试自动从具有不同名称标签(如
<c-practitioner-search ... >
)的网站读取数据。我尝试将它们视为普通标签并使用 find_elements() 但没有成功。
以下是部分 HTML 代码:
<div data-item-id="256b3a7e-2963-49b0-9841-1397b03c7f02">
<div data-priority class="ui-widget">
<c-practitioner-search data-data-rendering-service-uid="40">
<div c-practitionersearch_practitionersearch="">
<div c-practitionersearch_practitionersearch="">
<lightning-layout c-practitionersearch_practitionersearch="" class="bread-crumbs-style slds-grid">
<slot lwc-4p3ig5mhdla class="slds-slot">
<lightning-layout-item c-practitionersearch_practitionersearch>
<slot>
<a c-practitionersearch_practitionersearch class="search-result-name-text-style"> Title </a>
<small c-practitionersearch_practitionersearch>
<ul c-practitionersearch_practitionersearch class="browser-default">
<li c-practitionersearch_practitionersearch class="search-result-text-style"> Info 1 </li>
<li c-practitionersearch_practitionersearch class="search-result-text-style"> Info 2 </li>
<li c-practitionersearch_practitionersearch class="search-result-text-style"> Info 3 </li>
</ul>
</small>
</slot>
</lightning-layout-item>
</slot>
</lightning-layout>
</div>
</div>
</c-practitioner-search>
</div>
</div>
类似
<c-practitioner-search>
或 <div c-practitionersearch_practitionersearch>
的标签不会返回且无法使用。
那么我如何从这种代码中获取信息?
如果您需要原始网站:https://bams.vba.vic.gov.au/bams/s/practitioner-search
由于网站是动态的,使用 bs4 等其他东西是没有用的。
我尝试对主体的所有元素进行循环,但我只看到了大约 20% 的元素,那么我在浏览器检查中看到的其他元素在哪里?
这是我的代码的一部分,用于查看所有元素:
class PractitionerScraper:
def __init__(self, url:str):
main_path = pathlib.Path().absolute()
driver_path = main_path / "chromedriver.exe"
service = Service(executable_path= driver_path)
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])
self.driver = webdriver.Chrome(options=chrome_options, service=service)
self.driver.implicitly_wait(4)
self.driver.get(url)
def search(self, time_to_load:int):
time.sleep(time_to_load)
main_div = self.driver.find_element(By.CSS_SELECTOR, "body div[class*='main']")
all_tags = main_div.find_elements(By.CSS_SELECTOR, "*")
for tag in all_tags:
print(tag.tag_name)
就我而言,只有单击“搜索”按钮后,这些元素才会显示。
这是使用 webdriver 提取所有标题的代码,您可以展开它以获取所有其他详细信息。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://bams.vba.vic.gov.au/bams/s/practitioner-search'
driver = webdriver.Chrome()
driver.get(url)
elem_search = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//button[text()="Search"]')))
elem_search.click()
all_items = []
# Getting items from the first page
elem_item = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//a[@class="search-result-name-text-style"]')))
for item in elem_item:
all_items.append(item.text)
# Iterating over the other 9 pages
for i in range(2, 11):
elem_next_page = driver.find_element(By.XPATH, f'//button[text()="{i}"]')
elem_next_page.click()
elem_item = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//a[@class="search-result-name-text-style"]')))
for item in elem_item:
all_items.append(item.text)
但是,对于此类任务,我通常会尝试查看浏览器 API 调用,因为有时可以通过 POST 请求提取必要的数据。显然它适用于给定的网站。看看这段代码:
import requests
import json
msg = ('{"actions":[{"id":"130;a","descriptor":"aura://ApexActionController/ACTION$execute",'
'"callingDescriptor":"UNKNOWN","params":{"namespace":"","classname":"PractitionerSearchUtil",'
'"method":"getPractitioners","params":{"searchParamWrapper":{"practitionerName":"",'
'"registrationCategory":"","registrationClass":"","accreditationType":"Building","pageNumber":%s}},'
'"cacheable":false,"isContinuation":false}}]}')
context = ('{"mode":"PROD","fwuid":"ZDROWDdLOGtXcTZqSWZiU19ZaDJFdzk4bkk0bVJhZGJCWE9mUC1IZXZRbmcyNDguMTAuNS01LjAuMTA",'
'"app":"siteforce:communityApp","loaded":{"APPLICATION@markup://siteforce:communityApp":"7766VpxH8B5ZgC8Vrgi-bQ",'
'"COMPONENT@markup://instrumentation:o11ySecondaryLoader":"nSN3-Xh18FbrdCVGqsWZnw"},"dn":[],"globals":{},"uad":false}'),
query_url = 'https://bams.vba.vic.gov.au/bams/s/sfsites/aura?r=10&aura.ApexAction.execute=1'
output = []
for i in range(1, 11):
data = {'message': msg % i,
'aura.context': context,
'aura.pageURI': '/bams/s/practitioner-search',
'aura.token': 'null'}
req = requests.post(query_url, data=data)
result = json.loads(req.text)['actions'][0]['returnValue']['returnValue']['PractitionerDetailList']
output = output + result
输出如下所示。但请注意,搜索结果仅限于 500 个项目,这意味着您不会获得全部结果。使用
searchParamWrapper
中的 msg
进行过滤可能是解决此问题的一种方法。
[{'accountId': '0015m000005zlweAAA', 'accreditationType': 'Building', 'detailURL': 'https://bams.vba.vic.gov.au/bams/s/practitioner-detail?inputParams=4kJrFwhH8NkDASSlvhB1OLm0bPtCHoI89qyIP085njkBEls14yPsUa2Rz31qgadG', 'haveFilteredVICAccreditations': False, 'haveNoAMRAccreditations': False, 'haveNoVICAccreditations': False, 'haveOnlySuspendedEndedAMRAccreditations': False, 'haveUnfilteredVICAccreditations': False, 'isADR': False, 'practitionerId': '#067547', 'practitionerName': '1 HOMES PTY LTD', 'registrationCategoryWithClass': 'Domestic Builder Company - Domestic Builder - Unlimited', 'registrationClass': 'Domestic Builder - Unlimited', 'registrationId': 'a1O5m000000Y1YuEAK', 'registrationNumber': 'CDB-U 73774', 'registrationType': 'Victorian practitioner', 'status': 'Current', 'statusStyleClassName': 'Current'}]
...
...