Python -Google搜索 - 如何设置灵活的结果选择

Question

我试图通过Google搜索抓取一些访问它们的页面，我需要添加一些受限制的单词列表。

可以说，谷歌搜索中Python的4个最佳结果是：

欢迎来到Python.org https://www.python.org/
Python（编程语言） - 维基百科https://en.wikipedia.org/wiki/Python_(programming_language)
Python教程 - W3Schools https://www.w3schools.com/python/
学习Python - 免费的交互式Python教程https://www.learnpython.org/

然后我想在搜索描述和/或链接中打开不包含诸如：[“。org”，“wikipedia”]之类的单词的第一个结果 - （所以在这种情况下脚本会打开w3schools）

我试图用不同的选择器完成工作/并获得整个谷歌搜索页面文档，但到目前为止没有出现积极的结果：

search = driver.find_element_by_name('q') 
search.send_keys("Gran Hotel La Florida G.L Monumento")
search.send_keys(Keys.RETURN) # hit return after you enter search text time.sleep(5)
driver.find_element_by_class_name('LC20lb').click()

这将打开第一个非广告结果。

Answer 1

您可以更新您的选择器以单击所需的链接：

driver.find_element_by_xpath('//h3[@class="LC20lb" and not(contains(text(), "org")) and not(contains(text(), "wikipedia"))]').click()

这将排除包含子字符串"org"和"wikipedia"的结果

Answer 2

CSS：

也许类似以下内容排除基于href（也限制为以http开头的href并删除那些类.fl .:not伪类传递条件列表 - 在这种情况下主要是子串以排除via contains运算符。

.r [href^=http]:not(.fl,[href*=\.org],[href*=wikipedia])

以下测试用例通过多个不同国家的Google搜索进行了测试

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs

d = webdriver.Chrome()
d.get('https://www.google.com/')
d.find_element_by_css_selector('[title=Search]').send_keys('python')
WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '[type=submit]'))).click()
WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,'.r')))
soup = bs(d.page_source, 'lxml')
links =  [link['href'] for link in soup.select('.r [href^=http]:not(.fl,[href*=\.org],[href*=wikipedia])')]
print(links)

Python -Google搜索 - 如何设置灵活的结果选择

问题描述投票：2回答：2

2个回答

最新问题

Python -Google搜索 - 如何设置灵活的结果选择

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2