我需要使用 BeautifulSoup 和/或 Selenium 从网页中提取特定信息。我正在尝试从网页中提取与特定生物体相关的信息,但遇到了困难。
我试过这个
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"
# Open a Chrome browser
driver = webdriver.Chrome()
# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"
# Navigate to the search URL
driver.get(search_url)
from selenium.webdriver.common.by import By
# Find elements containing the text "JCM 5058"
elements = driver.find_elements(By.XPATH, "//*[contains(text(), 'JCM 5058')]")
if elements:
print("Text 'JCM 5058' found on the webpage.")
# Loop through elements and extract text
text_to_print = ""
for element in elements:
text_to_print += element.text + "\n" # Add newline for readability
# Print the extracted text
print(text_to_print)
else:
print("Text 'JCM 5058' not found on the webpage.")
我就变成这样了
Text 'JCM 5058' found on the webpage.
JCM 5058
("Streptomyces anthocyanicus"[Organism] AND ("Streptomyces anthocyanicus"[Organism] OR JCM 5058[All Fields])) AND (latest[filter] AND all[filter] NOT anomalous[filter])
Streptomyces anthocyanicus JCM 5058 AND (latest[filter] AND all[f... (6)
但是匹配部分在网页中看起来像这样
ASM1465115v1
Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: JCM 5058
Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)
Date: 2020/09/12
Assembly level: Scaffold
Genome representation: full
Relation to type material: assembly from type material
GenBank assembly accession: GCA_014651155.1 (latest)
RefSeq assembly accession: GCF_014651155.1 (latest)
IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]
所以我想提取或打印所有这些信息或在表格中
请帮忙
我在周围工作时得到了答案,但不知道这是正确的方法,
from selenium import webdriver
from bs4 import BeautifulSoup
# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"
# Open a Chrome browser
driver = webdriver.Chrome()
# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"
# Navigate to the search URL
driver.get(search_url)
# Get the page source after Selenium waits for the page to fully load
page_source = driver.page_source
# Use BeautifulSoup to parse the page source
soup = BeautifulSoup(page_source, 'html.parser')
# Find all div elements containing assembly information
assembly_divs = soup.find_all("div", class_="rprt")
# Loop through each div and check if it contains the desired information
for div in assembly_divs:
if "JCM 5058" in div.get_text():
# Print the assembly information
print(div.get_text().strip())
break
else:
print("No matched section found on the webpage.")
# Close the browser
driver.quit()
将打印此内容
Select item 81211415.ASM1465115v1Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)Infraspecific name: Strain: JCM 5058Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)Date: 2020/09/12Assembly level: ScaffoldGenome representation: fullRelation to type material: assembly from type materialGenBank assembly accession: GCA_014651155.1 (latest) RefSeq assembly accession: GCF_014651155.1 (latest) IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]