我正在尝试从Google专利中搜索数据,并发现执行时间过长。我怎样才能提高速度?通过8000项专利已经花了7个小时......
Here是专利的一个例子。
我需要从下面的表中获取数据并将它们写入csv文件。我认为瓶颈在WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']")))
这是必要的还是我可以使用find_elements_by_css_selector并检查是否返回任何内容?
#...
from selenium.webdriver.support import expected_conditions as EC
#...
## read file of patent numbers and initiate chrome
url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"
for x in patent_number:
#url = new url with new patent number similar to above
try:
driver.get(url)
driver.set_page_load_timeout(20)
except:
#--write to csv
continue
if "404" in driver.title: #patent number not found
#--write to csv
continue
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
)
except:
#--write to csv
continue
## rest of code to get data from tables and write to csv
是否有更有效的方法来查找专利页面上是否存在这些表格?或者如果我使用BeautifulSoup会有区别吗?
我是webscraping的新手,所以任何帮助都会非常感激:)
不确定您使用的是哪些表,但考虑到您可以使用请求和pandas来获取表,以及Session可以重用连接。
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
for code in codes:
url = 'https://patents.google.com/patent/{}/en?oq={}'.format(code, code)
r = s.get(url)
tables = pd.read_html(str(r.content))
print(tables) #example only. Remove later
#here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' ....
# rather than print... whatever steps to store info you want until write out