有没有一种更快的方法可以在不使用硒的情况下实现此网络抓取?

问题描述 投票:0回答:1

我有一段Python代码,用于将一段英文文本输入到网站中(https://edu.visl.dk/visl/en/parsing/automatic/trees.php),并提取语法树从中汲取。到目前为止,我一直在硒中这样做:

options = webdriver.ChromeOptions()
        options.add_argument('--headless')
        service = Service(executable_path="c:\Program Files (x86)\chromedriver.exe")
        driver = webdriver.Chrome(service=service, options=options)
    
        driver.get("https://edu.visl.dk/visl/en/parsing/automatic/trees.php")
        form = driver.find_element(By.NAME, "theform")
        dropdown = form.find_element(By.NAME, "visual")
        drop = Select(dropdown)
        drop.select_by_visible_text("Vertical")
        search = form.find_element(By.NAME, "text")
        search.send_keys("John killed the cat with a hammer.")
        submit = form.find_element(By.TAG_NAME, "input")
        submit.click()
        results = driver.find_element(By.TAG_NAME, "pre")
        soup = str(bs(results.get_attribute("innerHTML"), "html.parser"))

但是,我需要一遍又一遍(循环)执行此操作,这对于我的目的来说太慢了。有没有更快的方法来做到这一点?

python selenium-webdriver web-scraping
1个回答
0
投票

使用

requests
代替:

import requests
from bs4 import BeautifulSoup

payload = {
    'text': 'John killed the cat with a hammer.',
    'export': 'Export and Download',
    'parser': 'tree',
    'visual': 'vertical',
    'symbol': 'default',
}

url = 'https://edu.visl.dk/visl/en/parsing/automatic/trees.php'
response = requests.post(url, data=payload)

soup = BeautifulSoup(response.text, 'html.parser')
result = soup.body.pre.get_text(strip=True)

print(result)
© www.soinside.com 2019 - 2024. All rights reserved.