我想废弃一个网站,当我到达任何
<a>
标签时,链接是“job/undefined”,我使用post请求从页面获取数据。
在此代码中使用 postdata 发布请求:
from bs4 import BeautifulSoup
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
postData = {
'search': 'search',
'facets[camp_type]':'day_camp',
'open[choices-made-content]': 'true'}
url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)
soup1 = BeautifulSoup(html_1.text, 'lxml')
a = soup1.select('div.MuiGrid-root MuiGrid-grid-xs-12 ')
b = soup1.select('span[class="MuiTypography-root MuiTypography-h2"]')
print('soup:',b)
输出样本:
<span class="MuiTypography-root MuiTypography-h2" style="cursor:pointer">
<a href="job/undefined" style="color:#413E52;text-decoration:none">
Network and Security engineer
</a>
</span>
部分内容是动态提供的,因此,您必须通过 api 获取作业 hashid,然后自己创建链接或使用 JSON 响应中的数据:
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
url = 'https://api.trustme.work/api/job_offers?include=technologies%2Cjob%2Ccompany%2Ccontract_type%2Clevel'
jobs = requests.get(url, headers=headers).json()['included']['jobs']
['https://www.trustme.work/job/' + v['hashid'] for k,v in jobs.items()]
要从每个职位发布中获取链接,请更改
css selector
以选择更具体的元素,还可以尝试在类上使用静态标识符或 HTML 结构:
.select('h2 a')
要获取所有链接的
list
,请使用列表理解:
['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]
from bs4 import BeautifulSoup
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
postData = {
'search': 'search',
'facets[camp_type]':'day_camp',
'open[choices-made-content]': 'true'}
url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)
soup1 = BeautifulSoup(html_1.text, 'lxml')
['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]