我需要使用 BeautifulSoup 和/或 Selenium 从网页中提取特定信息。我正在尝试从网页中提取与特定生物体相关的信息,但遇到了困难。
我试过这个
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"
# Open a Chrome browser
driver = webdriver.Chrome()
# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"
# Navigate to the search URL
driver.get(search_url)
from selenium.webdriver.common.by import By
# Find elements containing the text "JCM 5058"
elements = driver.find_elements(By.XPATH, "//*[contains(text(), 'JCM 5058')]")
if elements:
print("Text 'JCM 5058' found on the webpage.")
# Loop through elements and extract text
text_to_print = ""
for element in elements:
text_to_print += element.text + "\n" # Add newline for readability
# Print the extracted text
print(text_to_print)
else:
print("Text 'JCM 5058' not found on the webpage.")
我就变成这样了
Text 'JCM 5058' found on the webpage.
JCM 5058
("Streptomyces anthocyanicus"[Organism] AND ("Streptomyces anthocyanicus"[Organism] OR JCM 5058[All Fields])) AND (latest[filter] AND all[filter] NOT anomalous[filter])
Streptomyces anthocyanicus JCM 5058 AND (latest[filter] AND all[f... (6)
但是匹配部分在网页中看起来像这样
ASM1465115v1
Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: JCM 5058
Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)
Date: 2020/09/12
Assembly level: Scaffold
Genome representation: full
Relation to type material: assembly from type material
GenBank assembly accession: GCA_014651155.1 (latest)
RefSeq assembly accession: GCF_014651155.1 (latest)
IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]
我想提取或打印所有这些信息或将其打印在表格中。
我在周围工作时得到了答案,但不知道这是正确的方法,
from selenium import webdriver
from bs4 import BeautifulSoup
# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"
# Open a Chrome browser
driver = webdriver.Chrome()
# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"
# Navigate to the search URL
driver.get(search_url)
# Get the page source after Selenium waits for the page to fully load
page_source = driver.page_source
# Use BeautifulSoup to parse the page source
soup = BeautifulSoup(page_source, 'html.parser')
# Find all div elements containing assembly information
assembly_divs = soup.find_all("div", class_="rprt")
# Loop through each div and check if it contains the desired information
for div in assembly_divs:
if "JCM 5058" in div.get_text():
# Print the assembly information
print(div.get_text().strip())
break
else:
print("No matched section found on the webpage.")
# Close the browser
driver.quit()
将打印此内容
Select item 81211415.ASM1465115v1Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)Infraspecific name: Strain: JCM 5058Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)Date: 2020/09/12Assembly level: ScaffoldGenome representation: fullRelation to type material: assembly from type materialGenBank assembly accession: GCA_014651155.1 (latest) RefSeq assembly accession: GCF_014651155.1 (latest) IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]
另一个简单的方法是
from selenium import webdriver
from selenium.webdriver.common.by import By
# Open a Chrome browser
driver = webdriver.Chrome()
# Load the webpage
driver.get("https://www.ncbi.nlm.nih.gov/assembly/?term=Streptomyces+anthocyanicus+JCM+5058")
# Find the element containing the GenBank assembly accession using XPath
genbank_element = driver.find_element(By.XPATH, "//dl[contains(., 'JCM 5058')]/following-sibling::dl[6]")
# Extract the GenBank assembly accession text
genbank_accession = genbank_element.text.split(": ")[1]
# Print the GenBank assembly accession
print(genbank_accession)
# Close the browser
driver.quit()
打印
GCA_014651155.1 (latest)