我正在尝试抓取以下页面中包含的表格:https://predictioncenter.org/casp14/results.cgi?view=tables&target=T1024&model=1&groups_id=
在表格顶部,我想将模型“1”更改为“-全部-”。 我正在编写以下代码行:
link = f"https://predictioncenter.org/casp14/results.cgi?view=tables&target=T1024&model=- All -&groups_id="
browser.get(link)
但这不起作用。
当我用
model=- All -
替换 model=1
时,代码有效,所以我怀疑我的 - All - 选项出了问题,但我不知道是什么。
下面的完整代码,循环遍历所有目标和模型选项(上面的版本已简化):
from bs4 import BeautifulSoup,NavigableString, Tag
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import csv
import os
import numpy as np
os.chdir('THE DIRECTORY WHERE YOUR CHROMEDRIVER IS')
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(executable_path='THE DIRECTORY WHERE YOUR CHROMEDRIVER IS/chromedriver')
browser.get("https://predictioncenter.org/") #open page in browser
df = pd.DataFrame()
x = browser.find_elements(By.XPATH, "//a[contains(@id, 'ygtvlabelel6')]")[0].click()
x = browser.find_elements(By.XPATH, "//a[contains(@href, 'results.cgi')]")[0].click()
x = browser.find_elements(By.XPATH, "//a[contains(@id, 'a_T1024')]")[0].click()
content = browser.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
#Get all possible options
options = soup.find("select",{"name":"target"}).findAll("option")
list_prot = []
for i in options:
name = i.text
list_prot.append(name)
type_model = soup.find("select",{"name":"model"}).findAll("option")
model_t=[]
for i in type_model:
name = i.text
model_t.append(name)
mod=model_t[0]
i=0
final=pd.DataFrame()
for target in list_prot:
print(i)
link = f"https://predictioncenter.org/casp14/results.cgi?view=tables&target={target}&model={mod}&groups_id="
browser.get(link)
这是获取包含
-All-
结果的表格的一种方法:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
headers = {
'Origin': 'https://predictioncenter.org',
'Referer': 'https://predictioncenter.org/casp14/results.cgi',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
payload = {
'target': 'T1024',
'groups_id': '',
'model': '',
'submit': 'Submit',
'order': '',
'field': '',
'view': 'results',
'lga_4_view': 'brief',
'lga_5_view': 'brief',
'dsc_view': 'brief',
'dali_view': 'brief',
'molprb_view': 'brief'
}
url = 'https://predictioncenter.org/casp14/results.cgi'
r = requests.post(url, headers=headers, data=payload)
results_table = bs(r.text, 'html.parser').select_one('table[class="table_results"]')
df = pd.read_html(str(results_table))[0]
print(df)
终端结果:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
0 General General General General General LGA Sequence Dependent (4Å) Full LGA Sequence Dependent (4Å) Full LGA Sequence Dependent (4Å) Full LGA Sequence Independent (4Å) Full LGA Sequence Independent (4Å) Full MAMMOTH Dali Full Molprobity Full lDDT SphGr CAD CAD RPF QCS QCS SOV CE CoDM DFM Handed. TM TM FlexE ASE
1 # Model GR# GR Name Charts GDT_TS NP_P Z-M1-GDT AL0_P AL4_P Z-score Z-Score MP-Score Global score SG AA SS RPF QCS contS SOV CE CoDM DFM Handed. TMscore TMalign FlexE ASE
2 1. T1024TS427_3 427 AlphaFold2 A D I G 79.22 100.00 NaN 82.61 96.16 13.16 58.6 1.05 0.89 99.62 0.82 0.69 0.90 96.35 96.86 82.30 7.84 0.99 0.04 0.97 0.93 0.93 0.35 90.02
3 2. T1024TS427_5 427 AlphaFold2 A D I G 71.67 100.00 NaN 69.82 86.70 10.97 54.8 1.10 0.88 99.62 0.82 0.68 0.88 94.42 96.61 79.70 7.74 0.98 0.06 0.95 0.88 0.88 0.35 83.46
4 3. T1024TS226_5 226 s Zhang-TBM A D I G 65.73 100.00 NaN 71.10 85.42 11.57 41.8 1.94 0.64 84.27 0.65 0.39 0.71 82.88 81.55 83.80 7.64 0.92 0.33 0.92 0.87 0.87 3.40 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
509 508. T1024TS170_4 170 s BhageerathH-Plus A D I G 9.46 100.00 NaN 1.02 6.65 0.66 2.2 0.97 0.20 12.40 0.40 0.07 0.24 25.99 41.23 40.90 4.74 0.21 2.02 0.53 0.23 0.42 201.20 NaN
510 509. T1024TS063_5 063 s ACOMPMOD A D I G 9.02 100.00 NaN 0.26 0.51 -0.07 0 4.98 0.06 10.10 0.30 0.05 0.16 21.49 29.85 37.60 3.70 0.08 2.17 0.50 0.17 0.27 746.50 92.81
511 510. T1024TS305_1 305 s CAO-SERVER A D I G 8.76 100.00 -3.36 0.00 0.00 -0.80 0 4.74 0.11 18.67 0.41 0.07 0.23 26.36 48.22 45.30 3.50 0.09 2.32 0.48 0.22 0.30 151.85 18.11
512 511. T1024TS342_2 342 CUTSP A D I G 8.76 100.00 NaN 0.00 0.00 0.75 4.4 2.98 0.20 12.66 0.42 0.08 0.26 24.47 43.00 42.50 5.33 0.15 2.23 0.49 0.15 0.28 410.17 NaN
513 512. T1024TS217_5 217 CAO-QA1 A D I G 8.44 100.00 NaN 1.02 5.37 -1.15 0 NaN 0.08 36.92 0.42 0.00 0.10 23.51 48.36 56.20 3.70 0.21 2.00 0.50 0.20 0.32 167.41 17.24
514 rows × 29 columns
您可能想为表标题选择另一行 - 请参阅相关的 pandas 文档此处
您可以使用 selenium 从模型下拉列表中选择“- all -”,因为它需要单击下拉列表,然后使用
select_by_index()
方法选择所需的值,它应该按预期工作。
完整工作代码:
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
URL = 'https://predictioncenter.org/casp14/results.cgi?view=tables&target=T1024&model=1&groups_id='
driver.get(URL)
driver.maximize_window()
time.sleep(5)
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.table > tbody tr:first-child > td:nth-child(3)'))).click()
time.sleep(2)
dropdown=Select(driver.find_element(By.CSS_SELECTOR,'select#model'))
time.sleep(2)
dropdown.select_by_index(0)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, "html.parser")
table= soup.select_one('.table_results')
df = pd.read_html(str(table))[0]
print(df)
driver.quit() # close browser
输出:
0 1 2 3 4 ... 24 25 26 27 28
0 General General General General General ... Handed. TM TM FlexE ASE
1 # Model GR# GR Name Charts ... Handed. TMscore TMalign FlexE ASE
2 # NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
3 NaN Model NaN NaN NaN ... NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ... ... ...
592 508. T1024TS170_4 170 s BhageerathH-Plus A D I G ... 0.53 0.23 0.42 201.20 NaN
593 509. T1024TS063_5 063 s ACOMPMOD A D I G ... 0.50 0.17 0.27 746.50 92.81
594 510. T1024TS305_1 305 s CAO-SERVER A D I G ... 0.48 0.22 0.30 151.85 18.11
595 511. T1024TS342_2 342 CUTSP A D I G ... 0.49 0.15 0.28 410.17 NaN
596 512. T1024TS217_5 217 CAO-QA1 A D I G ... 0.50 0.20 0.32 167.41 17.24
[597 rows x 29 columns]