我需要过滤掉 Grailed 上包含 0 个列表的所有网页,我有超过 500k 个 URL 需要浏览。
我正在使用Python和Selenium,我的问题是对于每个新网页,脚本需要单击cookie和用户登录弹出窗口来访问列表数量。结果是每个网页大约需要 13 秒来处理。对于 50 万个 URL,这将需要 75 天,而我没有。
这是我第一次进行网络抓取/编码/使用Python,所以我可能错过了很多明显的调整。
示例链接为:https://www.grailed.com/designers/acne-studios/casual-pants
所有 50 万个链接是:https://www.grailed.com/designers/designer-name/category-name
目前我正在考虑两种可能的方法:
尝试阻止 cookie 和用户登录弹出窗口。然而,我不确定如果不保存某种用户配置文件是否可以实现这一点,之后我担心我会被 Grailed 阻止。
同时运行多个实例,最好在 13(约 2 周)到 130(约 14 小时)之间。但是我不确定这会产生什么影响,它的成本会很高,以及如何避免被阻止,我是否需要为此使用代理?
虽然这两种方法对我来说似乎是显而易见的,但如果我遗漏了什么,请告诉我。
我的代码如下:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.action_chains import ActionChains
import os
import time
# Update the PATH environment variable
os.environ['PATH'] += r";C:\Users\rafme\Desktop\Selenium Drivers"
# Read the CSV file
BrandCategoryLinks = pd.read_csv('C:/Users/rafme/Downloads/Test Brands & Categories.csv')
FilteredCategoryLink = []
# Loop through each link in the DataFrame
for index, link in BrandCategoryLinks.iterrows():
driver = None
try:
base_url = link['Links']
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-gpu") # Disable GPU usage
chrome_options.add_argument("--no-sandbox") # Disable sandboxing
chrome_options.add_argument("--disable-dev-shm-usage") # Disable shared memory usage
chrome_options.add_argument("--window-size=1920x1080") # Set the window size
chrome_options.add_argument("--headless")
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
service = Service(r"C:\Users\rafme\Desktop\Selenium Drivers\chromedriver.exe")
driver = webdriver.Chrome(service=service, options=chrome_options)
driver.get(base_url)
timeout = 60 # Increase timeout
try:
WebDriverWait(driver, timeout).until(EC.presence_of_element_located((By.ID, "onetrust-reject-all-handler")))
reject_button = driver.find_element(By.ID, "onetrust-reject-all-handler")
# Scroll the element into view using JavaScript
driver.execute_script("arguments[0].scrollIntoView(true);", reject_button)
time.sleep(2) # Wait for the scrolling to complete
# Click the element
reject_button.click()
time.sleep(1)
reject_button.click()
time.sleep(1)
except (NoSuchElementException, ElementClickInterceptedException):
pass
except Exception as e:
print(f"Error occurred: {e}")
continue
# Close the user login modal if it exists
try:
elem = driver.find_element(By.XPATH, "//div[@class='Modal-Content']")
ac = ActionChains(driver)
ac.move_to_element(elem).move_by_offset(250, 0).click().perform() # clicking away from login window
except NoSuchElementException:
pass
except Exception as e:
print(f"Error clicking 'User Authentication' button: {e}")
continue
# Check listing count
try:
listing_count = driver.find_elements(By.XPATH,
"//div[@class='FiltersInstantSearch']//div[@class='feed-item']")
if len(listing_count) > 1:
print(f"Found {len(listing_count)} listings on {base_url}")
FilteredCategoryLink.append(base_url)
else:
print(f"Found {len(listing_count)} listings on {base_url}, not enough to keep.")
except Exception as e:
print(f"Error finding listings: {e}")
continue
except Exception as e:
print(f"Error processing link {link}: {e}")
finally:
if driver:
driver.quit()
# Save the filtered categories to CSV
filtered_categories = pd.DataFrame(FilteredCategoryLink, columns=['Link'])
filtered_categories.to_csv('filtered_categories.csv', index=False)
非常感谢大家抽出时间来解决我的问题!
正如评论中所建议的,最好使用 Python 的
requests
库通过 API 提取数据。
该网站目前约有 12k 设计师和 128 子类别,这将导致多达 1.5M 数据点。以下是显着加快速度的 3 个步骤:
requests
,每个请求只需约 0.3 秒,与 13 秒相比,这又增加了 43x 的改进。将这些东西放在一起,比原来的实现速度提高了 ~180,000x。换句话说,提取所有数据需要稍微多于一分钟。
希望这能提供一些有用的见解。
import requests
import json
from urllib.parse import quote
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'X-Algolia-Api-Key': 'bc9ee1c014521ccf312525a4ef324a16',
'X-Algolia-Application-Id': 'MNRWEFSS2Q'}
url_designers = 'https://www.grailed.com/api/designers'
req_designers = requests.get(url_designers, headers=headers)
designers = json.loads(req_designers.text)['data']
url_api = 'https://mnrwefss2q-dsn.algolia.net/1/indexes/*/queries'
data = []
for des in designers:
facetFilters = quote(f'[["designers.name:{des['name']}"]]')
facets = quote('["category_path"]')
payload = '{"requests":[{"indexName": "Listing_by_low_price_production", "params": "maxValuesPerFacet=200&hitsPerPage=0&facetFilters=%s&facets=%s"}]}' % (facetFilters, facets)
req = requests.post(url_api, headers=headers, data=payload)
listings = json.loads(req.text)['results'][0]['facets']['category_path']
data.append({des['name']: listings})
data
输出如下所示:
[
...
{'Acne Studios': {'bottoms.denim': 4644, 'tops.sweaters_knitwear': 2266, 'tops.sweatshirts_hoodies': 1658, 'womens_bottoms.jeans': 1122, 'tops.short_sleeve_shirts': 1087, 'bottoms.casual_pants': 1078, 'tops.button_ups': 960, 'outerwear.light_jackets': 591, 'womens_tops.sweaters': 557, 'footwear.lowtop_sneakers': 331, 'tops.long_sleeve_shirts': 295, 'outerwear.heavy_coats': 289, 'bottoms.shorts': 288, 'outerwear.denim_jackets': 279, 'outerwear.bombers': 257, 'womens_bottoms.pants': 211, 'outerwear.leather_jackets': 207, 'accessories.hats': 188, 'womens_tops.sweatshirts': 160, 'womens_tops.short_sleeve_shirts': 159, 'tailoring.blazers': 142, 'womens_footwear.boots': 140, 'outerwear.parkas': 122, 'womens_dresses.midi': 116, 'bottoms.sweatpants_joggers': 108, 'tops.polos': 107, 'accessories.gloves_scarves': 102, 'footwear.hitop_sneakers': 94, 'womens_outerwear.jackets': 91, 'womens_tops.blouses': 90, 'womens_outerwear.coats': 86, 'footwear.boots': 83, 'womens_tops.button_ups': 80, 'bottoms.cropped_pants': 76, 'tops.sleeveless': 74, 'womens_dresses.mini': 65, 'womens_footwear.lowtop_sneakers': 65, 'accessories.bags_luggage': 64, 'womens_tops.long_sleeve_shirts': 64, 'womens_outerwear.denim_jackets': 60, 'accessories.sunglasses': 57, 'womens_outerwear.blazers': 55, 'footwear.leather': 53, 'womens_accessories.scarves': 53, 'womens_outerwear.leather_jackets': 52, 'womens_bottoms.mini_skirts': 47, 'womens_bottoms.midi_skirts': 44, 'tailoring.suits': 41, 'womens_dresses.maxi': 41, 'womens_accessories.hats': 40, 'womens_tops.hoodies': 40, 'womens_tops.tank_tops': 38, 'womens_bottoms.shorts': 37, 'outerwear.vests': 35, 'womens_outerwear.bombers': 31, 'footwear.formal_shoes': 29, 'womens_footwear.heels': 29, 'accessories.jewelry_watches': 25, 'tailoring.formal_trousers': 24, 'womens_tops.crop_tops': 22, 'womens_tops.polos': 22, 'outerwear.raincoats': 19, 'womens_outerwear.down_jackets': 18, 'outerwear.cloaks_capes': 17, 'womens_accessories.miscellaneous': 17, 'womens_bags_luggage.shoulder_bags': 17, 'accessories.misc': 16, 'accessories.wallets': 16, 'footwear.slip_ons': 15, 'womens_footwear.sandals': 14, 'womens_accessories.sunglasses': 13, 'womens_bags_luggage.tote_bags': 12, 'womens_bottoms.joggers': 12, 'accessories.belts': 11, 'accessories.glasses': 11, 'womens_footwear.flats': 11, 'footwear.sandals': 10, 'tops.jerseys': 10, 'womens_footwear.hitop_sneakers': 10, 'womens_footwear.platforms': 9, 'womens_bottoms.leggings': 8, 'womens_bottoms.maxi_skirts': 8, 'accessories.socks_underwear': 7, 'bottoms.swimwear': 7, 'womens_accessories.belts': 7, 'womens_outerwear.vests': 7, 'bottoms.jumpsuits': 6, 'womens_footwear.slip_ons': 6, 'womens_bags_luggage.crossbody_bags': 5, 'womens_bottoms.sweatpants': 5, 'tailoring.vests': 4, 'womens_accessories.socks_intimates': 4, 'womens_accessories.wallets': 4, 'womens_bags_luggage.handle_bags': 4, 'womens_dresses.gowns': 4, 'accessories.periodicals': 3, 'accessories.ties_pocketsquares': 3, 'bottoms.leggings': 3, 'tailoring.formal_shirting': 3, 'womens_bags_luggage.clutches': 3, 'womens_bags_luggage.mini_bags': 3, 'womens_bags_luggage.other': 3, 'womens_jewelry.necklaces': 3, 'womens_outerwear.fur_faux_fur': 3, 'bottoms': 2, 'womens_bags_luggage.backpacks': 2, 'womens_bags_luggage.bucket_bags': 2, 'womens_bottoms.jumpsuits': 2, 'womens_footwear.mules': 2, 'womens_jewelry.bracelets': 2, 'womens_jewelry.earrings': 2, 'tailoring.tuxedos': 1, 'womens_accessories.glasses': 1, 'womens_accessories.hair_accessories': 1, 'womens_jewelry.body_jewelry': 1, 'womens_jewelry.rings': 1, 'womens_outerwear.rain_jackets': 1, 'womens_tops.bodysuits': 1}},
{'A.Coba.Lt': {'footwear.boots': 1, 'tops.sweatshirts_hoodies': 1}},
{'A Cold Wall': {'tops.short_sleeve_shirts': 293, 'tops.sweatshirts_hoodies': 280, 'footwear.lowtop_sneakers': 187, 'bottoms.sweatpants_joggers': 183, 'outerwear.light_jackets': 148, 'tops.long_sleeve_shirts': 133, 'accessories.bags_luggage': 129, 'bottoms.casual_pants': 108, 'tops.sweaters_knitwear': 71, 'accessories.hats': 61, 'outerwear.vests': 61, 'footwear.boots': 56, 'bottoms.shorts': 54, 'footwear.hitop_sneakers': 52, 'outerwear.heavy_coats': 48, 'tops.button_ups': 47, 'outerwear.raincoats': 30, 'accessories.belts': 24, 'bottoms.denim': 21, 'accessories.misc': 18, 'outerwear.denim_jackets': 18, 'outerwear.parkas': 17, 'outerwear.bombers': 14, 'accessories.gloves_scarves': 13, 'footwear.leather': 11, 'tops.polos': 11, 'footwear.slip_ons': 10, 'womens_bottoms.midi_skirts': 10, 'accessories.jewelry_watches': 8, 'accessories.sunglasses': 7, 'accessories.socks_underwear': 6, 'accessories.wallets': 6, 'tops.sleeveless': 6, 'bottoms.cropped_pants': 5, 'footwear.sandals': 5, 'bottoms.leggings': 4, 'outerwear.cloaks_capes': 4, 'tops.jerseys': 4, 'tailoring.blazers': 3, 'womens_bottoms.jeans': 3, 'womens_footwear.boots': 3, 'womens_footwear.hitop_sneakers': 3, 'accessories.periodicals': 2, 'footwear.formal_shoes': 2, 'womens_bottoms.shorts': 2, 'womens_outerwear.rain_jackets': 2, 'womens_tops.sweaters': 2, 'accessories.glasses': 1, 'bottoms.jumpsuits': 1, 'bottoms.swimwear': 1, 'tailoring.suits': 1, 'womens_bottoms.leggings': 1, 'womens_outerwear.vests': 1, 'womens_tops.button_ups': 1}}
...
]