[我正在尝试编写一个程序,该程序对https://echa.europa.eu/进行化学搜索并获得结果。 “搜索化学品”字段位于主页的中间。我想通过提供cas号(例如67-56-1)来搜索每种化学物质,从而得到结果URL。我得到的网址似乎不包含提供的cas号。
我尝试将不同的cas号(71-23-8)插入“ p_p_id”字段,但未提供预期的搜索结果。https://echa.europa.eu/search-for-chemicals?p_p_id=71-23-8
我还检查了Chrome请求的GET方法的标头,其中也不包含CAS号。
网站是否使用变量存储输入查询?是否有一种方法或工具可用于获取包括搜索cas号在内的结果URL?
一旦弄清楚了,我将使用Python获取数据并将其保存为excel文件。
谢谢。
您需要通过请求一次主URL来获取JESSIONID
cookie,然后在https://echa.europa.eu/search-for-chemicals
上发送POST。但这还需要一些必需的URL参数
query="71-23-8"
millis=$(($(date +%s%N)/1000000))
curl -s -I -c cookie.txt 'https://echa.europa.eu/search-for-chemicals'
curl -s -L -b cookie.txt 'https://echa.europa.eu/search-for-chemicals' \
--data-urlencode "p_p_id=disssimplesearch_WAR_disssearchportlet" \
--data-urlencode "p_p_lifecycle=1" \
--data-urlencode "p_p_state=normal" \
--data-urlencode "p_p_col_id=column-1" \
--data-urlencode "p_p_col_count=2" \
--data-urlencode "_disssimplesearch_WAR_disssearchportlet_javax.portlet.action=doSearchAction" \
--data-urlencode "_disssimplesearch_WAR_disssearchportlet_backURL=https://echa.europa.eu/home?p_p_id=disssimplesearchhomepage_WAR_disssearchportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=2" \
--data-urlencode "_disssimplesearchhomepage_WAR_disssearchportlet_sessionCriteriaId=" \
--data "_disssimplesearchhomepage_WAR_disssearchportlet_formDate=$millis" \
--data "_disssimplesearch_WAR_disssearchportlet_searchOccurred=true" \
--data "_disssimplesearch_WAR_disssearchportlet_sskeywordKey=$query" \
--data "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer=on" \
--data "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox=on"
使用python并用beautifulsoup刮取>
import requests from bs4 import BeautifulSoup import time url = 'https://echa.europa.eu/search-for-chemicals' query = '71-23-8' s = requests.Session() s.get(url) r = s.post(url, params = { "p_p_id": "disssimplesearch_WAR_disssearchportlet", "p_p_lifecycle": "1", "p_p_state": "normal", "p_p_col_id": "column-1", "p_p_col_count": "2", "_disssimplesearch_WAR_disssearchportlet_javax.portlet.action": "doSearchAction", "_disssimplesearch_WAR_disssearchportlet_backURL": "https://echa.europa.eu/home?p_p_id=disssimplesearchhomepage_WAR_disssearchportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=2", "_disssimplesearchhomepage_WAR_disssearchportlet_sessionCriteriaId": "" }, data = { "_disssimplesearchhomepage_WAR_disssearchportlet_formDate": int(round(time.time() * 1000)), "_disssimplesearch_WAR_disssearchportlet_searchOccurred": "true", "_disssimplesearch_WAR_disssearchportlet_sskeywordKey": query, "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimer": "on", "_disssimplesearchhomepage_WAR_disssearchportlet_disclaimerCheckbox": "on" } ) soup = BeautifulSoup(r.text, "html.parser") table = soup.find("table") data = [ ( t[0].find("a").text.strip(), t[0].find("a")["href"], t[0].find("div", {"class":"substanceRelevance"}).text.strip(), t[1].text.strip(), t[2].text.strip(), t[3].find("a")["href"] if t[3].find("a") else "", t[4].find("a")["href"] if t[4].find("a") else "", ) for t in (t.find_all('td') for t in table.find_all("tr")) if len(t) > 0 and t[0].find("a") is not None ] print(data)
注意,如果在服务器上实际检查了时间戳参数(formDate参数),我已经设置了该参数