我如何在不真正打开浏览器并获取表内容的情况下使用硒? Python

问题描述 投票:1回答:3

我正在尝试在Web数据库中运行python程序以获取结果表。我的问题是在不实际打开浏览器的情况下如何获得结果?有没有更简单的方法来获取表格结果?我想获取最底表的值(例如BCS类,溶解度,剂量等)。 (结果的底部没有td.text,所以我不能使用find next si兄命令。)谢谢!这是我的代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

driver= webdriver.Chrome()
driver.get('http://www.ddfint.net/search.cfm')
search_form=driver.find_element_by_name('compoundName')
search_form.send_keys('Abacavir')
search_form.submit()

page= BeautifulSoup(driver.page_source, 'html.parser')
page.find("td", text="Lowest Solubility (mg/ml):").find_next_sibling("td").text
python selenium browser
3个回答
3
投票

您可以尝试将--headless参数添加到ChromeOptions以告知浏览器不进行渲染。

from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)

2
投票

您可以使用requests模块。获取BCS类(logP)文本存在问题,因为在两个BCS类表数据中HTML都损坏了。解决方案是使用html5lib作为解析器。

import requests
from bs4 import BeautifulSoup
import re

headers = {
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'Origin': 'http://www.ddfint.net',
    'Upgrade-Insecure-Requests': '1',
    'DNT': '1',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/77.0.3865.90 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,'
              'application/signed-exchange;v=b3',
    'Referer': 'http://www.ddfint.net/search.cfm',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
}

data = {
    'compoundName': 'Abacavir',
    'category': '',
    'subcategory': '',
    'Submit': 'Search'
}

response = requests.post('http://www.ddfint.net/results.cfm', headers=headers, data=data, verify=False)
page = BeautifulSoup(response.text, 'html5lib')
print(page.find("td", text="Lowest Solubility (mg/ml):").find_next_sibling("td").text)

# get header row
header_row = page.find("td", text="Country List:").find_parent("tr")
# get columns names, remove : and *
header_data = [re.sub("[:*]", "", td.text.strip()) for td in header_row.find_all("td")]

country_index = header_data.index("Country List")
solubility_index = header_data.index("Solubility")
bcs_class_clogp_index = header_data.index("BCS Class (cLogP)")
bcs_class_logp_index = header_data.index("BCS Class (logP)")

row = header_row
while True:
    # check if next row exist
    row = row.find_next_sibling("tr")
    if not row:
        break

    # collect row data
    row_data = [td.text.strip() for td in row.find_all("td")]
    print(row_data[country_index], row_data[solubility_index],
          row_data[bcs_class_clogp_index], row_data[bcs_class_logp_index])

带有pandas的代码:

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

headers = {
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'Origin': 'http://www.ddfint.net',
    'Upgrade-Insecure-Requests': '1',
    'DNT': '1',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/77.0.3865.90 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,'
              'application/signed-exchange;v=b3',
    'Referer': 'http://www.ddfint.net/search.cfm',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
}

data = {
    'compoundName': 'Abacavir',
    'category': '',
    'subcategory': '',
    'Submit': 'Search'
}

response = requests.post('http://www.ddfint.net/results.cfm', headers=headers, data=data, verify=False)
page = BeautifulSoup(response.text, 'html5lib')
print(page.find("td", text="Lowest Solubility (mg/ml):").find_next_sibling("td").text)

# get header row
header_row = page.find("td", text="Country List:").find_parent("tr")
# get columns names, and remove : and *
header_data = [re.sub("[:*]", "", td.text.strip()) for td in header_row.find_all("td")]

# loop while there's row after header row
data = []
row = header_row
while True:
    # check if next row exist
    row = row.find_next_sibling("tr")
    if not row:
        break

    # collect row data
    row_data = [td.text.strip() for td in row.find_all("td")]
    data.append(row_data)

# create data frame
df = pd.DataFrame(data, columns=header_data)

1
投票

您可以通过添加选项(--headless)来使硒无头:

from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)

关于遍历一个表,您可以选择所有表行,并且仅采用相关的行,如下所示:

from selenium.webdriver.common.by import By
rows = table_id.find_elements(By.TAG_NAME, "tr")
myrow = rows[5]
© www.soinside.com 2019 - 2024. All rights reserved.