我想抓取“链接”、“标题”和“摘要”
我怎样才能抓取这个?
我试过了
import requests
import json
url = 'http://www.arxiv-sanity.com/top?timefilter=year&vfilter=all'
res = requests.get(url)
text = res.text
# print(text)
d = json.loads(text)
print(d['title'], d['link'], d['abstract'])
但是
SONDecodeError: Expecting value: line 1 column 1 (char 0)
发生了
该 URL 返回 HTML,而不是 json 响应。所以你无法对其进行 JSON 解码。
使用BeautifulSoup:
import requests
import json
from bs4 import BeautifulSoup as bs
url = 'http://www.arxiv-sanity.com/top?timefilter=year&vfilter=all'
res = requests.get(url)
text = res.text
soup=bs(text, "html.parser")
extract=soup.select('script')[6]
target = extract.decode().split('var papers = ')[1]
target2 = target.replace("}, {","}xxx{").replace('[{','{').replace('}];','}')
final = target2.split('xxx')
for i in range(len(final)):
if i == len(final)-1:
last = final[i].split('var pid')[0]
d = json.loads(last)
print(d['title'],d['link'],d['abstract'])
else:
d = json.loads(final[i])
print(d['title'],d['link'],d['abstract'])
输出示例:
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
http://arxiv.org/abs/1810.04805v2
We introduce a new language representation model called BERT, which stands
for Bidirectional Encoder Representations from Transformers. Unlike recent
language representation models, BERT is designed to pre-train deep
bidirectional representations from unlabeled text by jointly conditioning on
both left and right context in all layers...
等等
arxiv-sanity
反应似乎已经改变了。这是 jack-fleeting 的 answer 的更新版本。
import requests
import json
from bs4 import BeautifulSoup as bs
url = 'http://www.arxiv-sanity.com/top?timefilter=year&vfilter=all'
res = requests.get(url)
text = res.text
soup=bs(text, "html.parser")
extract=soup.select('script')[6]
extract = soup.select('script')[1]
target = extract.decode().split('var papers = ')[1]
target2 = target.replace("}, {","}xxx{").replace('[{','{').replace('}];','}')
final = target2.split('xxx')
for i in range(len(final)):
# try:
if i == len(final)-1:
last = final[i].split('var tags')[0]
d = json.loads(last)
# It doesn't give 'abstract', 'link' anymore
# print(d['title'],d['link'],d['abstract'])
# Instead, it gives 'id' which can be used to make linkq
link = "https://arxiv.org/abs/" + d['id']
print(d['title'], link ,d['summary'])
else:
d = json.loads(final[i])
# It doesn't give 'abstract', 'link' anymore
# print(d['title'],d['link'],d['abstract'])
# Instead, it gives 'id' which can be used to make linkq
link = "https://arxiv.org/abs/" + d['id']
print(d['title'], link ,d['summary'])