BeautifulSoup在instagram html页面中查找

问题描述 投票:1回答:3

我有一个问题,找到与bs4的东西。

我正在尝试在html instagram页面中自动找到一些网址(并且知道我是一个python noob)我找不到在html源代码中自动搜索的方法"display_url": http..."之后的例子中的网址。

我想让我的脚本搜索多次显示为“display_url”的url并下载它们。它们必须被提取出源代码中出现的次数。


用bs4我尝试了:

f = urllib.request.urlopen(fileURL)
htmlSource = f.read()
soup = bs(htmlSource, 'html.parser')
metaTag = soup.find_all('meta', {'property': 'og:image'})
imgURL = metaTag[0]['content']
urllib.request.urlretrieve(imgURL, 'fileName.jpg')

但我不能让soup.find_all(...工作/搜索它。有没有办法让我用bs4找到页面的这一部分?

非常感谢你的帮助。

这是我现在的小(python)代码的一个例子:qazxsw poi

<!––cropped...............--> <body class=""> <span id="react-root"><svg width="50" height="50" viewBox="0 0 50 50" style="position:absolute;top:50%;left:50%;margin:-25px 0 0 -25px;fill:#c7c7c7"> <path d=" <!––deleted part for privacy --> " /> </svg></span> <script type="text/javascript"> window._sharedData = { "config": { "csrf_token": "", "viewer": { <!––deleted part for privacy --> "viewerId": "" }, "supports_es6": true, "country_code": "FR", "language_code": "fr", "locale": "fr_FR", "entry_data": { "PostPage": [{ "graphql": { "shortcode_media": { "__typename": "GraphSidecar", <!––deleted part for privacy --> "dimensions": { "height": 1080, "width": 1080 }, "gating_info": null, "media_preview": null, <--There's the important part that have to be extracted as many times it appear in the source code--> "display_url": "https://scontent-cdt1-1.cdninstagram.com/vp/", "display_resources": [{ "src": "https://scontent-cdt1-1.cdninstagram.com/vp/", "config_width": 640, "config_height": 640 }, { "src": "https://scontent-cdt1-1.cdninstagram.com/vp/", "config_width": 750, "config_height": 750 }, { "src": "https://scontent-cdt1-1.cdninstagram.com/vp/", "config_width": 1080, "config_height": 1080 }], "is_video": false, <!––cropped...............-->

python web-scraping beautifulsoup find instagram
3个回答
1
投票

您可以找到相应的脚本标记和正则表达式的信息。我假设第一个包含my newest code的脚本标签是合适的。你可以根据需要弄乱。

window._sharedData =

感谢@ t.h.adam,可以将上述内容缩短为:

from bs4 import BeautifulSoup as bs
import re

html = '''
<html>
 <head></head>
 <body class=""> 
  <span id="react-root">
   <svg width="50" height="50" viewbox="0 0 50 50" style="position:absolute;top:50%;left:50%;margin:-25px 0 0 -25px;fill:#c7c7c7"> 
    <path d="

        <!––deleted part for privacy -->

         " /> 
   </svg></span> 
  <script type="text/javascript">
    window._sharedData = {
      "config": {
        "csrf_token": "",
        "viewer": {

        <!––deleted part for privacy -->

        "viewerId": ""
      },
      "supports_es6": true,
      "country_code": "FR",
      "language_code": "fr",
      "locale": "fr_FR",
      "entry_data": {
        "PostPage": [{
          "graphql": {
            "shortcode_media": {
              "__typename": "GraphSidecar",

     <!––deleted part for privacy -->

              "dimensions": {
                "height": 1080,
                "width": 1080
              },
              "gating_info": null,
              "media_preview": null,

<--There's the important part that have to be extracted as many times it appear in the source code-->

              "display_url": "https://scontent-cdt1-1.cdninstagram.com/vp/",
              "display_resources": [{
                "src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
                "config_width": 640,
                "config_height": 640
              }, {
                "src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
                "config_width": 750,
                "config_height": 750
              }, {
                "src": "https://scontent-cdt1-1.cdninstagram.com/vp/",
                "config_width": 1080,
                "config_height": 1080
              }],
              "is_video": false,</script>
 </body>
</html>
'''

soup = bs(html, 'lxml')
scripts = soup.select('script[type="text/javascript"]')
for script in scripts:
    if ' window._sharedData =' in script.text:
        data = script.text
        break
r = re.compile(r'"display_url":(.*)",')
print(r.findall(data))

0
投票

程序先进了,它变成了这样的:

soup = bs(html, 'lxml')
r = re.compile(r'"display_url":(.*)",')
data = soup.find('script', text=r).text
print(r.findall(data))

但现在出现了一些新的东西

  • 只要('“display_url”:“to - >”,“display_resources”:')出现在tet.txt文件中,如何使程序的查找url部分(第10,11行)重复?
  • 可以使用while循环,但如何使其重复该过程?

0
投票

问题解决了

这是在iOS上使用Pythonista 3从instagram网址下载多个图像的代码:

thepage = urllib.request.urlopen(html)
    soup = BeautifulSoup(thepage, "html.parser")
    print(soup.title.text)
    txt = soup.select('script[type="text/javascript"]')[3] 
    texte = txt.get_text()
    f1 = open("tet.txt", 'w')
    f1.write(texte)
    f1.close() 
    with open('tet.txt','r') as f:
        data=''.join(f.readlines())
    print(data[data.index('"display_url":"'):data.index('","display_resources":')+1])

这有点挑剔,但它也在迅速发挥作用。谢谢你的帮助。

© www.soinside.com 2019 - 2024. All rights reserved.