想要从 json 数据中删除图像网址

问题描述 投票:0回答:1

我,我正在尝试仅从 json 结构中删除每个产品的图像网址,仅 jpg 扩展名以及“alt”中可用的名称,例如(下面也提到)“attributes”>“media_map”>(“b”,“c”, “d”,e 可用)>“src”,然后“medium”,“lg”,“xl”,“xxl”

              "a218": {
                "label": "Shape",
                "field_type": "button_select",
                "value_order": [
                  "v766",
                  "v767"
                ],
                "values": {
                  "v766": {
                    "label": "Round",
                    "value": "S6CBRO",
                    "price": 35
                  },
                  "v767": {
                    "label": "Rectangle",
                    "value": "S6CBRE",
                    "price": 35,
                    "hypotheticalPrice": 24.5
                  }
                }
              }
            },
            "inventory": {
              "stock": 0,
              "sold": 0,
              "total": 0
            },
            "optional": {},
            "media_map": {
              "b": {
                "src": {
                  "xs": "https://ctl.s6img.com/society6/img/xVx1vleu7iLcR79ZkRZKqQiSzZE/w_125/artwork/~artwork/s6-0041/a/18613683_5971445",
                  "lg": "https://ctl.s6img.com/society6/img/W-ESMqUtC_oOEUjx-1E_SyIdueI/w_550/artwork/~artwork/s6-0041/a/18613683_5971445",
                  "xl": "https://ctl.s6img.com/society6/img/z90VlaYwd8cxCqbrZ1ttAxINpaY/w_700/artwork/~artwork/s6-0041/a/18613683_5971445",
                  "xxl": null
                },
                "type": "image",
                "alt": "I'M NOT ALWAYS A BITCH (Red) Cutting Board",
                "meta": null
              },
              "c": {
                "src": {
                  "xs": "https://ctl.s6img.com/society6/img/KQJbb4jG0gBHcqQiOCivLUbKMxI/w_125/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg",
                  "lg": "https://ctl.s6img.com/society6/img/ztGrxSpA7FC1LfzM3UldiQkEi7g/w_550/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg",
                  "xl": "https://ctl.s6img.com/society6/img/PHjp9jDic2NGUrpq8k0aaxsYZr4/w_700/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg",
                  "xxl": "https://ctl.s6img.com/society6/img/m-1HhSM5CIGl6DY9ukCVxSmVDIw/w_1500/cutting-board/rectangle/lifestyle/~artwork,fw_1572,fh_2500,fx_93,fy_746,iw_1386,ih_2142/s6-0041/a/18613725_13086827/~~/im-not-always-a-bitch-red-cutting-board.jpg"```
 below is my code i,m able to access "media_map" but dnt know how to access jpg extension url

```contents = []
with open('urls.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        contents.append(url) # Add each url to list contents
        newlist = []
        for url in contents:
            try:
                page = urlopen(url[0]).read()
                soup = BeautifulSoup(page, 'html.parser')
                scripts = soup.find_all('script')[7].text.strip()[24:]
                data = json.loads(scripts)
                link = data['product']['response']['product']['data']['attributes']['media_map']```

every product have "b" , "c" , "d" or "b" , "c" , "d" , "e" , "f"
or some products have only "b" , "c"
i,m new in scraping but stuck over there
python arrays web-scraping beautifulsoup
1个回答
0
投票

而不是

link = data['product']['response']['product']['data']['attributes']['media_map']

mediaMap = data['product']['response']['product']['data']['attributes']['media_map']

然后就可以从中提取你想要的了

mediaMap

如果你想要替代品:

mediaAlts = [m['alt'] for m in mediaMap.values() if 'alt' in m]

(如果你只想要第一个,就得到

mediaAlts[0]

或者如果您只想要图像替代品:

imgAlts = [
    m['alt'] for m in mediaMap.values() if 'alt' in m 
    and 'type' in m and m['type'] == 'image'
]

如果您想要 media_map 中的 first 对象中的所有 src 链接:

m1srcs = list(list(mediaMap.values())[0]['src'].values())

要过滤为仅 jpg:

m1srcs = [s for s in m1srcs if type(s) == str and s.endswith('.jpg')]


编辑:

对于所有带有 alts 的 jpg 图像:

altJpgs = [
    src for srcs in [[
            s for s in mv['src'].values()
            if type(s) == str and s.endswith('.jpg')
        ] for mv in mediaMap.values()
        if type(mv) == dict and 'src' in mv
        and 'alt' in mv # has alt
        and 'type' in mv and mv['type'] == 'image' # has type listed as image 
    ] for src in srcs
]

或者在这种情况下,for 循环可能比列表理解更具可读性:

altJpgs = []

for mv in mediaMap.values():
    if type(mv) != dict or 'src' not in mv: continue 
    if 'alt' not in mv: continue 
    if 'type' not in mv and mv['type'] != 'image': continue 

    for s in mv['src'].values():
        if type(s) == str and s.endswith('.jpg'):
            altJpgs.append(s)

(编辑或删除任何

if...
行以调整过滤器)

© www.soinside.com 2019 - 2024. All rights reserved.