我正在尝试在html instagram页面中自动找到一些网址(并且知道我是一个python noob)我找不到在html源代码中自动搜索的方法"display_url": http..."
之后的例子中的网址。
我想让我的脚本搜索多次显示为“display_url”的url并下载它们。它们必须被提取出源代码中出现的次数。
用bs4我尝试了:
f = urllib.request.urlopen(fileURL)
htmlSource = f.read()
soup = bs(htmlSource, 'html.parser')
metaTag = soup.find_all('meta', {'property': 'og:image'})
imgURL = metaTag[0]['content']
urllib.request.urlretrieve(imgURL, 'fileName.jpg')
但我不能让soup.find_all(...
工作/搜索它。有没有办法让我用bs4找到页面的这一部分?
非常感谢你的帮助。
这是我现在的小(python)代码的一个例子:qazxsw poi
<!––cropped...............-->
<body class="">
<span id="react-root"><svg width="50" height="50" viewBox="0 0 50 50" style="position:absolute;top:50%;left:50%;margin:-25px 0 0 -25px;fill:#c7c7c7">
<path
d="
<!––deleted part for privacy -->
" />
</svg></span>
<script type="text/javascript">
window._sharedData = {
"config": {
"csrf_token": "",
"viewer": {
<!––deleted part for privacy -->
"viewerId": ""
},
"supports_es6": true,
"country_code": "FR",
"language_code": "fr",
"locale": "fr_FR",
"entry_data": {
"PostPage": [{
"graphql": {
"shortcode_media": {
"__typename": "GraphSidecar",
<!––deleted part for privacy -->
"dimensions": {
"height": 1080,
"width": 1080
},
"gating_info": null,
"media_preview": null,
<--There's the important part that have to be extracted as many times it appear in the source code-->
"display_url": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"display_resources": [{
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 640,
"config_height": 640
}, {
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 750,
"config_height": 750
}, {
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 1080,
"config_height": 1080
}],
"is_video": false,
<!––cropped...............-->
您可以找到相应的脚本标记和正则表达式的信息。我假设第一个包含my newest code的脚本标签是合适的。你可以根据需要弄乱。
window._sharedData =
感谢@ t.h.adam,可以将上述内容缩短为:
from bs4 import BeautifulSoup as bs
import re
html = '''
<html>
<head></head>
<body class="">
<span id="react-root">
<svg width="50" height="50" viewbox="0 0 50 50" style="position:absolute;top:50%;left:50%;margin:-25px 0 0 -25px;fill:#c7c7c7">
<path d="
<!––deleted part for privacy -->
" />
</svg></span>
<script type="text/javascript">
window._sharedData = {
"config": {
"csrf_token": "",
"viewer": {
<!––deleted part for privacy -->
"viewerId": ""
},
"supports_es6": true,
"country_code": "FR",
"language_code": "fr",
"locale": "fr_FR",
"entry_data": {
"PostPage": [{
"graphql": {
"shortcode_media": {
"__typename": "GraphSidecar",
<!––deleted part for privacy -->
"dimensions": {
"height": 1080,
"width": 1080
},
"gating_info": null,
"media_preview": null,
<--There's the important part that have to be extracted as many times it appear in the source code-->
"display_url": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"display_resources": [{
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 640,
"config_height": 640
}, {
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 750,
"config_height": 750
}, {
"src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
"config_width": 1080,
"config_height": 1080
}],
"is_video": false,</script>
</body>
</html>
'''
soup = bs(html, 'lxml')
scripts = soup.select('script[type="text/javascript"]')
for script in scripts:
if ' window._sharedData =' in script.text:
data = script.text
break
r = re.compile(r'"display_url":(.*)",')
print(r.findall(data))
程序先进了,它变成了这样的:
soup = bs(html, 'lxml')
r = re.compile(r'"display_url":(.*)",')
data = soup.find('script', text=r).text
print(r.findall(data))
但现在出现了一些新的东西
这是在iOS上使用Pythonista 3从instagram网址下载多个图像的代码:
thepage = urllib.request.urlopen(html)
soup = BeautifulSoup(thepage, "html.parser")
print(soup.title.text)
txt = soup.select('script[type="text/javascript"]')[3]
texte = txt.get_text()
f1 = open("tet.txt", 'w')
f1.write(texte)
f1.close()
with open('tet.txt','r') as f:
data=''.join(f.readlines())
print(data[data.index('"display_url":"'):data.index('","display_resources":')+1])
这有点挑剔,但它也在迅速发挥作用。谢谢你的帮助。