我正在尝试从作业发布数据中抓取网站,输出如下所示:
[{'job_title':'初级数据科学家','公司':'\ n \ n BBC',摘要':“\ n我们现在正在寻找一名初级数据科学家来与我们的营销和受众团队合作伦敦。数据科学团队负责设计......“,'link':'www.jobsite.com', 'summary_text':“工作介绍\ n想象一下,如果Netflix,赫芬顿邮报,ESPN和Spotify都融为一体......等等
我想创建一个如下所示的数据框或CSV:
现在,这是我正在使用的循环:
for page in pages:
source = requests.get('https://www.jobsite.co.uk/jobs?q=data+scientist&start='.format()).text
soup = BeautifulSoup(source, 'lxml')
results = []
for jobs in soup.findAll(class_='result'):
result = {
'job_title': '',
'company': '',
'summary': '',
'link': '',
'summary_text': ''
}
使用循环后,我只打印结果。
在数据帧中获取输出的好方法是什么?谢谢!
我认为你需要定义页面数量并将其添加到你的网址中(确保你有一个占位符表示我不认为你的代码的值,也没有其他答案)。我这样做是通过扩展你的url来在查询字符串中包含一个包含占位符的页面参数。
你的班级result
的选择器是否正确?你当然也可以使用for job in soup.select('.job'):
。然后,您需要定义适当的选择器来填充值。我认为更容易获取每个页面的所有作业链接然后访问页面并从页面中的json字符串中提取值。添加Session
以重新使用连接。
需要显式等待以防止被阻止
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
import time
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
results = []
links = []
pages = 3
with requests.Session() as s:
for page in range(1, pages + 1):
try:
url = 'https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page={}'.format(page)
source = s.get(url, headers = headers).text
soup = bs(source, 'lxml')
links.append([link['href'] for link in soup.select('.job-title a')])
except Exception as e:
print(e, url )
finally:
time.sleep(2)
final_list = [item for sublist in links for item in sublist]
for link in final_list:
source = s.get(link, headers = headers).text
soup = bs(source, 'lxml')
data = soup.select_one('#jobPostingSchema').text #json like string containing all info
item = json.loads(data)
result = {
'Title' : item['title'],
'Company' : item['hiringOrganization']['name'],
'Url' : link,
'Summary' :bs(item['description'],'lxml').text
}
results.append(result)
time.sleep(1)
df = pd.DataFrame(results, columns = ['Title', 'Company', 'Url', 'Summary'])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
结果样本:
我无法想象你想要所有页面,但你可以使用类似的东西:
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
import time
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
results = []
links = []
pages = 0
def get_links(url, page):
try:
source = s.get(url, headers = headers).text
soup = bs(source, 'lxml')
page_links = [link['href'] for link in soup.select('.job-title a')]
if page == 1:
global pages
pages = int(soup.select_one('.page-title span').text.replace(',',''))
except Exception as e:
print(e, url )
finally:
time.sleep(1)
return page_links
with requests.Session() as s:
links.append(get_links('https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page=1',1))
for page in range(2, pages + 1):
url = 'https://www.jobsite.co.uk/jobs?q=data+scientist&start=1&page={}'.format(page)
links.append(get_links(url, page))
final_list = [item for sublist in links for item in sublist]
for link in final_list:
source = s.get(link, headers = headers).text
soup = bs(source, 'lxml')
data = soup.select_one('#jobPostingSchema').text #json like string containing all info
item = json.loads(data)
result = {
'Title' : item['title'],
'Company' : item['hiringOrganization']['name'],
'Url' : link,
'Summary' :bs(item['description'],'lxml').text
}
results.append(result)
time.sleep(1)
df = pd.DataFrame(results, columns = ['Title', 'Company', 'Url', 'Summary'])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
看看pandas Dataframe API。有几种方法可以初始化数据帧
您只需要将列表或字典附加到全局变量,您应该很高兴。
results = []
for page in pages:
source = requests.get('https://www.jobsite.co.uk/jobs?q=data+scientist&start='.format()).text
soup = BeautifulSoup(source, 'lxml')
for jobs in soup.findAll(class_='result'):
result = {
'job_title': '', # assuming this has value like you shared in the example in your question
'company': '',
'summary': '',
'link': '',
'summary_text': ''
}
results.append(result)
# results is now a list of dictionaries
df= pandas.DataFrame(results)
另一个建议是,不要考虑将其转储到同一程序的数据框中。首先将所有HTML文件转储到一个文件夹中,然后再次解析它们。这样,如果您需要以前未考虑的页面中的更多信息,或者由于某些解析错误或超时而导致程序终止,则工作不会丢失。将解析与爬网逻辑分开。