找不到python selenium dom元素

问题描述 投票:0回答:1

我正在尝试自动从具有不同名称标签(如

<c-practitioner-search ... >
)的网站读取数据。我尝试将它们视为普通标签并使用 find_elements() 但没有成功。

以下是部分 HTML 代码:

<div data-item-id="256b3a7e-2963-49b0-9841-1397b03c7f02">
  <div data-priority class="ui-widget"> 
    <c-practitioner-search data-data-rendering-service-uid="40">
      <div c-practitionersearch_practitionersearch="">
        <div c-practitionersearch_practitionersearch="">
          <lightning-layout c-practitionersearch_practitionersearch="" class="bread-crumbs-style slds-grid">
            <slot lwc-4p3ig5mhdla class="slds-slot">
              <lightning-layout-item c-practitionersearch_practitionersearch>
                <slot>
                  <a c-practitionersearch_practitionersearch class="search-result-name-text-style">  Title </a>
                  <small c-practitionersearch_practitionersearch>
                     <ul c-practitionersearch_practitionersearch class="browser-default">
                       <li c-practitionersearch_practitionersearch class="search-result-text-style"> Info 1 </li>
                       <li c-practitionersearch_practitionersearch class="search-result-text-style"> Info 2 </li>
                       <li c-practitionersearch_practitionersearch class="search-result-text-style"> Info 3 </li>
                     </ul>
                   </small>
                 </slot>
               </lightning-layout-item>
             </slot>
           </lightning-layout>
         </div>
       </div>
     </c-practitioner-search>
   </div>
 </div>
                

类似

<c-practitioner-search>
<div c-practitionersearch_practitionersearch>
的标签不会返回且无法使用。

那么我如何从这种代码中获取信息?

如果您需要原始网站:https://bams.vba.vic.gov.au/bams/s/practitioner-search

由于网站是动态的,使用 bs4 等其他东西是没有用的。

我尝试对主体的所有元素进行循环,但我只看到了大约 20% 的元素,那么我在浏览器检查中看到的其他元素在哪里?

这是我的代码的一部分,用于查看所有元素:

class PractitionerScraper:
    def __init__(self, url:str):
        main_path = pathlib.Path().absolute()

        driver_path = main_path / "chromedriver.exe"
        service = Service(executable_path= driver_path)

        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])

        self.driver = webdriver.Chrome(options=chrome_options, service=service)
        self.driver.implicitly_wait(4)
        self.driver.get(url)

    def search(self, time_to_load:int):
        time.sleep(time_to_load)
        main_div = self.driver.find_element(By.CSS_SELECTOR, "body div[class*='main']")

        all_tags = main_div.find_elements(By.CSS_SELECTOR, "*")
        for tag in all_tags:
            print(tag.tag_name)
python selenium-webdriver web-scraping
1个回答
0
投票

就我而言,只有单击“搜索”按钮后,这些元素才会显示。

这是使用 webdriver 提取所有标题的代码,您可以展开它以获取所有其他详细信息。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'https://bams.vba.vic.gov.au/bams/s/practitioner-search'
driver = webdriver.Chrome()
driver.get(url)

elem_search = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//button[text()="Search"]')))
elem_search.click()

all_items = []

# Getting items from the first page
elem_item = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//a[@class="search-result-name-text-style"]')))
for item in elem_item:
    all_items.append(item.text)

# Iterating over the other 9 pages
for i in range(2, 11):
    elem_next_page = driver.find_element(By.XPATH, f'//button[text()="{i}"]')
    elem_next_page.click()
    elem_item = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//a[@class="search-result-name-text-style"]')))
    for item in elem_item:
        all_items.append(item.text)

但是,对于此类任务,我通常会尝试查看浏览器 API 调用,因为有时可以通过 POST 请求提取必要的数据。显然它适用于给定的网站。看看这段代码:

import requests
import json

msg = ('{"actions":[{"id":"130;a","descriptor":"aura://ApexActionController/ACTION$execute",'
       '"callingDescriptor":"UNKNOWN","params":{"namespace":"","classname":"PractitionerSearchUtil",'
       '"method":"getPractitioners","params":{"searchParamWrapper":{"practitionerName":"",'
       '"registrationCategory":"","registrationClass":"","accreditationType":"Building","pageNumber":%s}},'
       '"cacheable":false,"isContinuation":false}}]}')

context = ('{"mode":"PROD","fwuid":"ZDROWDdLOGtXcTZqSWZiU19ZaDJFdzk4bkk0bVJhZGJCWE9mUC1IZXZRbmcyNDguMTAuNS01LjAuMTA",'
           '"app":"siteforce:communityApp","loaded":{"APPLICATION@markup://siteforce:communityApp":"7766VpxH8B5ZgC8Vrgi-bQ",'
           '"COMPONENT@markup://instrumentation:o11ySecondaryLoader":"nSN3-Xh18FbrdCVGqsWZnw"},"dn":[],"globals":{},"uad":false}'),

query_url = 'https://bams.vba.vic.gov.au/bams/s/sfsites/aura?r=10&aura.ApexAction.execute=1'

output = []
for i in range(1, 11):
    data = {'message': msg % i,
            'aura.context': context,
            'aura.pageURI': '/bams/s/practitioner-search',
            'aura.token': 'null'}
    
    req = requests.post(query_url, data=data)
    
    result = json.loads(req.text)['actions'][0]['returnValue']['returnValue']['PractitionerDetailList']
    output = output + result

输出如下所示。但请注意,搜索结果仅限于 500 个项目,这意味着您不会获得全部结果。使用

searchParamWrapper
中的
msg
进行过滤可能是解决此问题的一种方法。

[{'accountId': '0015m000005zlweAAA', 'accreditationType': 'Building', 'detailURL': 'https://bams.vba.vic.gov.au/bams/s/practitioner-detail?inputParams=4kJrFwhH8NkDASSlvhB1OLm0bPtCHoI89qyIP085njkBEls14yPsUa2Rz31qgadG', 'haveFilteredVICAccreditations': False, 'haveNoAMRAccreditations': False, 'haveNoVICAccreditations': False, 'haveOnlySuspendedEndedAMRAccreditations': False, 'haveUnfilteredVICAccreditations': False, 'isADR': False, 'practitionerId': '#067547', 'practitionerName': '1 HOMES PTY LTD', 'registrationCategoryWithClass': 'Domestic Builder Company - Domestic Builder - Unlimited', 'registrationClass': 'Domestic Builder - Unlimited', 'registrationId': 'a1O5m000000Y1YuEAK', 'registrationNumber': 'CDB-U 73774', 'registrationType': 'Victorian practitioner', 'status': 'Current', 'statusStyleClassName': 'Current'}]
...
...
© www.soinside.com 2019 - 2024. All rights reserved.