Beautifulsoup:href 链接未定义

问题描述 投票:0回答:1

我想废弃一个网站,当我到达任何

<a>
标签时,链接是“job/undefined”,我使用post请求从页面获取数据。

在此代码中使用 postdata 发布请求:

from bs4 import BeautifulSoup
import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}

postData = {
  'search': 'search',
  'facets[camp_type]':'day_camp',
  'open[choices-made-content]': 'true'}

url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)

soup1 = BeautifulSoup(html_1.text, 'lxml')
a = soup1.select('div.MuiGrid-root MuiGrid-grid-xs-12 ')
b = soup1.select('span[class="MuiTypography-root MuiTypography-h2"]')
print('soup:',b)

输出样本:

<span class="MuiTypography-root MuiTypography-h2" style="cursor:pointer">
    <a href="job/undefined" style="color:#413E52;text-decoration:none">
    Network and Security engineer
    </a>
</span>
python selenium web-scraping beautifulsoup
1个回答
3
投票

编辑

部分内容是动态提供的,因此,您必须通过 api 获取作业 hashid,然后自己创建链接或使用 JSON 响应中的数据:

import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
url = 'https://api.trustme.work/api/job_offers?include=technologies%2Cjob%2Ccompany%2Ccontract_type%2Clevel'
jobs = requests.get(url, headers=headers).json()['included']['jobs']

['https://www.trustme.work/job/' + v['hashid'] for k,v in jobs.items()]

要从每个职位发布中获取链接,请更改

css selector
以选择更具体的元素,还可以尝试在类上使用静态标识符或 HTML 结构:

.select('h2 a')

要获取所有链接的

list
,请使用列表理解:

['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]

示例

from bs4 import BeautifulSoup
import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}

postData = {
 'search': 'search',
 'facets[camp_type]':'day_camp',
 'open[choices-made-content]': 'true'}

url = 'https://www.trustme.work/en'
html_1 = requests.post(url, headers=headers, data=postData)

soup1 = BeautifulSoup(html_1.text, 'lxml')
['https://www.trustme.work' + a.get('href') for a in soup1.select('h2 a')]
© www.soinside.com 2019 - 2024. All rights reserved.