我尝试在此网站上进行网页抓取,但不起作用

问题描述 投票:0回答:1

它一直告诉我:[错误页面]

DevTools listening on ws://127.0.0.1:53501/devtools/browser/028c6371-d9c3-4a13-83e1-2d7f598da093
Attempting static scraping for https://secure.ethicspoint.com/domain/en/default_reporter.asp...
No company names found in static content.
Static scraping failed, attempting dynamic scraping for https://secure.ethicspoint.com/domain/en/default_reporter.asp...
Error in dynamic scraping: Message: no such element: Unable to locate element: {"method":"tag name","selector":"select"}
  (Session info: chrome=129.0.6668.101); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
        GetHandleVerifier [0x00AA5523+24195]
        (No symbol) [0x00A3AA04]
        (No symbol) [0x00932093]
        (No symbol) [0x00976ED2]
        (No symbol) [0x0097711B]
        (No symbol) [0x009B76F2]
        (No symbol) [0x0099AB84]
        (No symbol) [0x009B5280]
        (No symbol) [0x0099A8D6]
        (No symbol) [0x0096BA27]
        (No symbol) [0x0096C43D]
        GetHandleVerifier [0x00D6CE13+2938739]
        GetHandleVerifier [0x00DBEC69+3274185]
        GetHandleVerifier [0x00B309C2+594722]
        GetHandleVerifier [0x00B37EDC+624700]
        (No symbol) [0x00A437CD]
        (No symbol) [0x00A40528]
        (No symbol) [0x00A406C5]
        (No symbol) [0x00A32CA6]
        BaseThreadInitThunk [0x7648FCC9+25]
        RtlGetAppContainerNamedObjectPath [0x779C80CE+286]
        RtlGetAppContainerNamedObjectPath [0x779C809E+238]

No companies found on https://secure.ethicspoint.com/domain/en/default_reporter.asp.

Attempting static scraping for https://app.convercent.com/en-us/Anonymous/IssueIntake/IdentifyOrganization...

Companies found on https://app.convercent.com/en-us/Anonymous/IssueIntake/IdentifyOrganization:
- Select your location
- Albania
- Andorra
- Angola
- Antigua and Barbuda
- Argentina

这是两个网站: (https://secure.ethicspoint.com/domain/en/default_reporter.asp) (https://app.convercent.com/en-us/Anonymous/IssueIntake/IdentifyOrganization

我尝试使用请求静态抓取和 Selenium 动态抓取,但我一直遇到一个问题,要么收到错误页面,要么无法提取必要的元素(例如带有公司名称的下拉菜单)

这是我使用的代码

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

def static_scrape(url):
   
    try:
       
        response = requests.get(url)

        if response.status_code == 200:
            
            soup = BeautifulSoup(response.text, 'html.parser')

            options = soup.find_all('option')

            if options:
                companies = [option.text.strip() for option in options if option.text.strip()]
                return companies
            else:
                print("No company names found in static content.")
                return None
        else:
            print(f"Failed to retrieve webpage. Status code: {response.status_code}")
            return None
    except Exception as e:
        print(f"Error in static scraping: {e}")
        return None

def dynamic_scrape(url):
   
    try:
        
        driver.get(url)

        
        time.sleep(5)

        dropdown = driver.find_element(By.TAG_NAME, 'select')

        
        options = dropdown.find_elements(By.TAG_NAME, 'option')

       
        companies = [option.text.strip() for option in options if option.text.strip()]

        return companies

    except Exception as e:
        print(f"Error in dynamic scraping: {e}")
        return None

def scrape_companies(url):
    
    print(f"Attempting static scraping for {url}...")
    companies = static_scrape(url)

    if companies is None:
        print(f"Static scraping failed, attempting dynamic scraping for {url}...")
        companies = dynamic_scrape(url)

    return companies


urls = [
    'https://secure.ethicspoint.com/domain/en/default_reporter.asp',
    'https://app.convercent.com/en-us/Anonymous/IssueIntake/IdentifyOrganization'
]


for url in urls:
    companies = scrape_companies(url)

    if companies:
        print(f"\nCompanies found on {url}:")
        for company in companies:
            print(f"- {company}")
    else:
        print(f"No companies found on {url}.\n")


driver.quit()
python html selenium-webdriver web-scraping
1个回答
0
投票

我看到你的评论后(你说你想知道使用这个的公司名称)我找到了一个获取公司名称的线索,这里是示例代码:

注意:我使用 requests 库来加速该过程(在本例中比 Selenium 更快)

示例代码:

import requests
from bs4 import BeautifulSoup

company = set()

url = 'https://secure.ethicspoint.com/domain/en/report_company.asp'

header = {
    "Content-Type": "application/x-www-form-urlencoded",
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0"
}

list_of_keyword = ['ar', 'br', 'cr', 'dr', 'ag', 'bc', 'bg', 'er', 'fr', 'ge', 'fa', 'ba', 'ca', 'da']
for i in list_of_keyword:
    data = f'CompanyName={i}&submit=Submit+Information&command=submit'
    site_2 = requests.get(f'https://app.convercent.com/en-us/Anonymous/IssueIntake/GetLegalEntity?searchText={i}').json()
    for name in site_2:
        company.add(name['name'])
    soup = BeautifulSoup(requests.post(url, data=data, headers=header).text, 'lxml')
    find_e = soup.findAll('label')
    for i in find_e:
        if not 'CompanyName' in i['for']:
            company.add(i.text)
    
for n, i in enumerate(company):
    print(f"No: {n} | Company: {i}")

输出:

No: 0 | Company: The Hongkong and Shanghai Hotels Limited
No: 1 | Company: VICSA (Vidrieras Canaria)
No: 2 | Company: Bahama & Co.
No: 3 | Company: Apotex (including Apobiologix)
No: 4 | Company: MTE - Crossville
No: 5 | Company: Canichiddeusi Wind S.r.l.
No: 6 | Company: A.E. Finley YMCA (YMCA of the Triangle Area)
No: 7 | Company: SAP Labs, LLC, Palo Alto, CA, United States 
No: 8 | Company: Canal Analytics
No: 9 | Company: WSI (BGIS Global Integrated Solutions)
No: 10 | Company: Mobile Data Technologies Ltd.
No: 11 | Company: Pluralsight
No: 12 | Company: Faradyne Motors LLC
No: 13 | Company: Art & Commerce
No: 14 | Company: A Fabrica
No: 15 | Company: Breeze-Eastern
No: 16 | Company: Frick
No: 17 | Company: Cadia Valley Operations
No: 18 | Company: Dolby Laboratories, Inc.
No: 19 | Company: DZ-4 Errichtungsgesellschaft mbH
No: 20 | Company: CorePower Yoga
No: 21 | Company: Charles Drew University of Med. and Scs. 
No: 22 | Company: EMS   (Southern California Edison)
No: 23 | Company: SCI FJ PART INVEST France 
No: 24 | Company: ABX Converting Acquisition, LLC
No: 25 | Company: Sybase, Inc., San Ramon, CA, United States 

让我知道这是您想要的还是其他东西

© www.soinside.com 2019 - 2024. All rights reserved.