Python和BeautifulSoup 4 - 循环返回重复的结果

问题描述 投票:0回答:2

我正试图从6pm.com刮掉而且我遇到了一个问题 - 我的循环似乎正在返回重复的结果,例如当不同的产品只出现一次时,它会多次重复使用同一产品。

这是我的代码:

url_list1 = ['https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=1',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=2',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=3',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=4',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=5',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=6',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=7',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=8',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=9',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=10',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=11',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=12',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=13',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=14',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=15',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=16',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=17',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=18',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=19',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=20',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=21',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=22',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=23',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=24',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=25',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=26',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=27',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=28',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=29',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=30',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=31',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=32',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=33',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=34',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=35',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=36',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=37',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=38',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=39',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=40',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=41',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=42',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=43',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=44',
         'https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?&p=45'
]


url_list2 = []

for url1 in url_list1:
    data1 = requests.get(url1)
    soup1 = BeautifulSoup(data1.text, 'html.parser')


    productUrls = soup1.findAll('article')


    for url2 in productUrls:
        get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
        url_list2.append(get_urls)

print(url_list2)

所以第一部分(url_list1)基本上是一个链接列表。每个链接指向包含所选品牌的100种产品的页面。当我点击每个链接并在浏览器中打开时,每个页面包含不同的产品,并且没有重复项(我知道)。

接下来,我初始化一个空列表(url_list2),我正在尝试存储所有实际的产品URL(因此该列表应包含46个页面* 100个产品=大约4600个产品URL)。

第一个“for”循环遍历url_list1中的每个链接。 productUrls变量是一个列表,它应该在46个页面中的每一个上存储所有“article”元素。

第二个嵌套的“for”循环遍历productUrls列表并构造实际的产品URL。然后应该将构造的产品URL附加到我之前初始化的空列表url_list2。

使用print语句测试结果,我注意到产品是重复的而不是不同的。

如果在我的浏览器中在url_list1中手动打开每个网址,为什么会发生这种情况?我可以在每个页面上看到不同的产品而不注意任何重复项?

任何和所有的帮助非常感谢。

python web-scraping beautifulsoup
2个回答
1
投票

你可以为这个场景做得更好。你不需要把所有urls列在列表中。请尝试下面的代码,这是你可以实现结果的简单方法。

from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent':
       'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso"

url_list2 = []
page_num = 1
session = requests.Session()
while page_num <47:
    pageTree = session.get(page, headers=headers)
    pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
    productUrls = pageSoup.findAll('article')
    for url2 in productUrls:
        get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
        url_list2.append(get_urls)

    page = "https://www.6pm.com/filters/men-shoes/CK_XAVpbwAOoCe4VJIQl5AeABqkZoB4_9RFN-hasCYIahQlrmQO7H4gkb7sa5RV9uQ-LIf8DgAG2BqcfmAGBD-IEoAGTBqcHoALTEqsBjBsBgw_SBs0Z4QbsHK0UyiHvJMABAuICAwELGA.zso?p={}".format(page_num)
    page_num +=1

print(url_list2)
print(len(url_list2))

如果有帮助,请告诉我。


1
投票

发生的情况是,您在浏览器中看到的页面与请求所获得的页面不同。要解决此问题,您必须使会话(请求)保持活动状态。

试试这个,它对我有用。用以下方法替换你的big for循环:

with requests.Session() as s:    # <--- here we create a session that stays alive
        for url1 in url_list1:
            data1 = s.get(url1)  # <--- here we call the links with the same session
            soup1 = BeautifulSoup(data1.text, 'html.parser')

            productUrls = soup1.findAll('article')

            for url2 in productUrls:
                get_urls = "https://www.6pm.com"+url2.find('a', attrs={'itemprop': 'url'})['href']
                url_list2.append(get_urls)

祝好运 !

© www.soinside.com 2019 - 2024. All rights reserved.