它一直告诉我:[错误页面]
DevTools listening on ws://127.0.0.1:53501/devtools/browser/028c6371-d9c3-4a13-83e1-2d7f598da093
Attempting static scraping for https://secure.ethicspoint.com/domain/en/default_reporter.asp...
No company names found in static content.
Static scraping failed, attempting dynamic scraping for https://secure.ethicspoint.com/domain/en/default_reporter.asp...
Error in dynamic scraping: Message: no such element: Unable to locate element: {"method":"tag name","selector":"select"}
(Session info: chrome=129.0.6668.101); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
GetHandleVerifier [0x00AA5523+24195]
(No symbol) [0x00A3AA04]
(No symbol) [0x00932093]
(No symbol) [0x00976ED2]
(No symbol) [0x0097711B]
(No symbol) [0x009B76F2]
(No symbol) [0x0099AB84]
(No symbol) [0x009B5280]
(No symbol) [0x0099A8D6]
(No symbol) [0x0096BA27]
(No symbol) [0x0096C43D]
GetHandleVerifier [0x00D6CE13+2938739]
GetHandleVerifier [0x00DBEC69+3274185]
GetHandleVerifier [0x00B309C2+594722]
GetHandleVerifier [0x00B37EDC+624700]
(No symbol) [0x00A437CD]
(No symbol) [0x00A40528]
(No symbol) [0x00A406C5]
(No symbol) [0x00A32CA6]
BaseThreadInitThunk [0x7648FCC9+25]
RtlGetAppContainerNamedObjectPath [0x779C80CE+286]
RtlGetAppContainerNamedObjectPath [0x779C809E+238]
No companies found on https://secure.ethicspoint.com/domain/en/default_reporter.asp.
Attempting static scraping for https://app.convercent.com/en-us/Anonymous/IssueIntake/IdentifyOrganization...
Companies found on https://app.convercent.com/en-us/Anonymous/IssueIntake/IdentifyOrganization:
- Select your location
- Albania
- Andorra
- Angola
- Antigua and Barbuda
- Argentina
这是两个网站: (https://secure.ethicspoint.com/domain/en/default_reporter.asp) (https://app.convercent.com/en-us/Anonymous/IssueIntake/IdentifyOrganization)
我尝试使用请求静态抓取和 Selenium 动态抓取,但我一直遇到一个问题,要么收到错误页面,要么无法提取必要的元素(例如带有公司名称的下拉菜单)
这是我使用的代码
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
def static_scrape(url):
try:
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
options = soup.find_all('option')
if options:
companies = [option.text.strip() for option in options if option.text.strip()]
return companies
else:
print("No company names found in static content.")
return None
else:
print(f"Failed to retrieve webpage. Status code: {response.status_code}")
return None
except Exception as e:
print(f"Error in static scraping: {e}")
return None
def dynamic_scrape(url):
try:
driver.get(url)
time.sleep(5)
dropdown = driver.find_element(By.TAG_NAME, 'select')
options = dropdown.find_elements(By.TAG_NAME, 'option')
companies = [option.text.strip() for option in options if option.text.strip()]
return companies
except Exception as e:
print(f"Error in dynamic scraping: {e}")
return None
def scrape_companies(url):
print(f"Attempting static scraping for {url}...")
companies = static_scrape(url)
if companies is None:
print(f"Static scraping failed, attempting dynamic scraping for {url}...")
companies = dynamic_scrape(url)
return companies
urls = [
'https://secure.ethicspoint.com/domain/en/default_reporter.asp',
'https://app.convercent.com/en-us/Anonymous/IssueIntake/IdentifyOrganization'
]
for url in urls:
companies = scrape_companies(url)
if companies:
print(f"\nCompanies found on {url}:")
for company in companies:
print(f"- {company}")
else:
print(f"No companies found on {url}.\n")
driver.quit()
我看到你的评论后(你说你想知道使用这个的公司名称)我找到了一个获取公司名称的线索,这里是示例代码:
注意:我使用 requests 库来加速该过程(在本例中比 Selenium 更快)
import requests
from bs4 import BeautifulSoup
company = set()
url = 'https://secure.ethicspoint.com/domain/en/report_company.asp'
header = {
"Content-Type": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0"
}
list_of_keyword = ['ar', 'br', 'cr', 'dr', 'ag', 'bc', 'bg', 'er', 'fr', 'ge', 'fa', 'ba', 'ca', 'da']
for i in list_of_keyword:
data = f'CompanyName={i}&submit=Submit+Information&command=submit'
site_2 = requests.get(f'https://app.convercent.com/en-us/Anonymous/IssueIntake/GetLegalEntity?searchText={i}').json()
for name in site_2:
company.add(name['name'])
soup = BeautifulSoup(requests.post(url, data=data, headers=header).text, 'lxml')
find_e = soup.findAll('label')
for i in find_e:
if not 'CompanyName' in i['for']:
company.add(i.text)
for n, i in enumerate(company):
print(f"No: {n} | Company: {i}")
No: 0 | Company: The Hongkong and Shanghai Hotels Limited
No: 1 | Company: VICSA (Vidrieras Canaria)
No: 2 | Company: Bahama & Co.
No: 3 | Company: Apotex (including Apobiologix)
No: 4 | Company: MTE - Crossville
No: 5 | Company: Canichiddeusi Wind S.r.l.
No: 6 | Company: A.E. Finley YMCA (YMCA of the Triangle Area)
No: 7 | Company: SAP Labs, LLC, Palo Alto, CA, United States
No: 8 | Company: Canal Analytics
No: 9 | Company: WSI (BGIS Global Integrated Solutions)
No: 10 | Company: Mobile Data Technologies Ltd.
No: 11 | Company: Pluralsight
No: 12 | Company: Faradyne Motors LLC
No: 13 | Company: Art & Commerce
No: 14 | Company: A Fabrica
No: 15 | Company: Breeze-Eastern
No: 16 | Company: Frick
No: 17 | Company: Cadia Valley Operations
No: 18 | Company: Dolby Laboratories, Inc.
No: 19 | Company: DZ-4 Errichtungsgesellschaft mbH
No: 20 | Company: CorePower Yoga
No: 21 | Company: Charles Drew University of Med. and Scs.
No: 22 | Company: EMS (Southern California Edison)
No: 23 | Company: SCI FJ PART INVEST France
No: 24 | Company: ABX Converting Acquisition, LLC
No: 25 | Company: Sybase, Inc., San Ramon, CA, United States
让我知道这是您想要的还是其他东西