美丽的汤,如何刮取多个网址并将其保存在csv文件中

问题描述 投票:0回答:2

所以我想知道如何抓取多个网站/网址并将它们(数据)保存到csv文件中。我现在只能保存第一页。我尝试了很多不同的方法,但似乎没有用。如何在csv文件中保存5个页面,而不仅仅是一个?

import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import timedelta
import datetime
import time

 urls = ['https://store.steampowered.com/search/?specials=1&page=1', 'https://store.steampowered.com/search/?specials=1&page=2', 'https://store.steampowered.com/search/?specials=1&page=3', 'https://store.steampowered.com/search/?specials=1&page=4','https://store.steampowered.com/search/?specials=1&page=5']

for url in urls:   
    my_url = requests.get(url) 
    html = my_url.content
    soup = BeautifulSoup(html,'html.parser')

    data = []
    ts = time.time()
    st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S') 

    for container in soup.find_all('div', attrs={'class':'responsive_search_name_combined'}):
        title = container.find('span',attrs={'class':'title'}).text

        if container.find('span',attrs={'class':'win'}):
            win = '1'
        else:
            win = '0'

        if container.find('span',attrs={'class':'mac'}):
            mac = '1'
        else:
            mac = '0'

        if container.find('span',attrs={'class':'linux'}):
            linux = '1'
        else:
            linux = '0'

        data.append({
            'Title':title.encode('utf-8'),
            'Time':st,
            'Win':win,
            'Mac':mac,
            'Linux':linux})

with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
    fields = ['Title','Win','Mac','Linux','Time']
    writer = csv.DictWriter(f, fieldnames=fields)
    writer.writeheader()
    writer.writerows(data)
testing = pd.read_csv('data.csv')
heading = testing.head(100)
discription = testing.describe()
print(heading)
python url web-scraping beautifulsoup
2个回答
2
投票

问题是你在每个网址后重新初始化你的数据。然后在最后一次迭代之后编写它,这意味着你将始终拥有从最后一个url获得的最后数据。您需要在每次迭代后附加数据并且不会被覆盖:

import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import timedelta
import datetime
import time

urls = ['https://store.steampowered.com/search/?specials=1&page=1', 'https://store.steampowered.com/search/?specials=1&page=2', 'https://store.steampowered.com/search/?specials=1&page=3', 'https://store.steampowered.com/search/?specials=1&page=4','https://store.steampowered.com/search/?specials=1&page=5']

results_df = pd.DataFrame() #<-- initialize a results dataframe to dump/store the data you collect after each iteration
for url in urls:   
    my_url = requests.get(url) 
    html = my_url.content
    soup = BeautifulSoup(html,'html.parser')

    data = []  #<-- your data list is "reset" after each iteration of your urls
    ts = time.time()
    st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S') 

    for container in soup.find_all('div', attrs={'class':'responsive_search_name_combined'}):
        title = container.find('span',attrs={'class':'title'}).text

        if container.find('span',attrs={'class':'win'}):
            win = '1'
        else:
            win = '0'

        if container.find('span',attrs={'class':'mac'}):
            mac = '1'
        else:
            mac = '0'

        if container.find('span',attrs={'class':'linux'}):
            linux = '1'
        else:
            linux = '0'

        data.append({
            'Title':title,
            'Time':st,
            'Win':win,
            'Mac':mac,
            'Linux':linux})

        temp_df = pd.DataFrame(data) #<-- temporary storing the data in a dataframe
        results_df = results_df.append(temp_df).reset_index(drop=True) #<-- dumping that data into a results dataframe


results_df.to_csv('data.csv', index=False) #<-- writing the results dataframe to csv

testing = pd.read_csv('data.csv')
heading = testing.head(100)
discription = testing.describe()
print(heading)

输出:

print (results_df)
     Linux Mac ...                                      Title Win
0        0   0 ...            Tom Clancy's Rainbow Six® Siege   1
1        0   0 ...            Tom Clancy's Rainbow Six® Siege   1
2        1   1 ...                    Total War: WARHAMMER II   1
3        0   0 ...            Tom Clancy's Rainbow Six® Siege   1
4        1   1 ...                    Total War: WARHAMMER II   1
5        0   1 ...                                  Frostpunk   1
6        0   0 ...            Tom Clancy's Rainbow Six® Siege   1
7        1   1 ...                    Total War: WARHAMMER II   1
8        0   1 ...                                  Frostpunk   1
9        1   1 ...                         Two Point Hospital   1
10       0   0 ...            Tom Clancy's Rainbow Six® Siege   1
11       1   1 ...                    Total War: WARHAMMER II   1
12       0   1 ...                                  Frostpunk   1
13       1   1 ...                         Two Point Hospital   1
14       0   0 ...                        Black Desert Online   1
15       0   0 ...            Tom Clancy's Rainbow Six® Siege   1
16       1   1 ...                    Total War: WARHAMMER II   1
17       0   1 ...                                  Frostpunk   1
18       1   1 ...                         Two Point Hospital   1
19       0   0 ...                        Black Desert Online   1
20       1   1 ...                       Kerbal Space Program   1
21       0   0 ...            Tom Clancy's Rainbow Six® Siege   1
22       1   1 ...                    Total War: WARHAMMER II   1
23       0   1 ...                                  Frostpunk   1
24       1   1 ...                         Two Point Hospital   1
25       0   0 ...                        Black Desert Online   1
26       1   1 ...                       Kerbal Space Program   1
27       1   1 ...                          BioShock Infinite   1
28       0   0 ...            Tom Clancy's Rainbow Six® Siege   1
29       1   1 ...                    Total War: WARHAMMER II   1
   ...  .. ...                                        ...  ..
1595     0   0 ...            VEGAS Pro 14 Edit Steam Edition   1
1596     0   0 ...                                       ABZU   1
1597     0   0 ...                              Sacred 2 Gold   1
1598     0   0 ...                              Sakura Bundle   1
1599     1   1 ...                                   Distance   1
1600     0   0 ...               LEGO® Batman™: The Videogame   1
1601     0   0 ...                               Sonic Forces   1
1602     0   0 ...                  The Stronghold Collection   1
1603     0   0 ...                                 Miscreated   1
1604     0   0 ...                         Batman™: Arkham VR   1
1605     1   1 ...                          Shadowrun Returns   1
1606     0   0 ...               Upgrade to VEGAS Pro 16 Edit   1
1607     0   0 ...               Girl Hunter VS Zombie Bundle   1
1608     0   1 ...                Football Manager 2019 Touch   1
1609     0   1 ...   Total War: NAPOLEON - Definitive Edition   1
1610     1   1 ...                           SteamWorld Dig 2   1
1611     0   0 ...                Condemned: Criminal Origins   1
1612     0   0 ...                          Company of Heroes   1
1613     0   0 ...           LEGO® Batman™ 2: DC Super Heroes   1
1614     1   1 ...         Euro Truck Simulator 2 Map Booster   1
1615     0   0 ...                         Sonic Adventure DX   1
1616     0   0 ...                           Worms Armageddon   1
1617     1   1 ...                       Unforeseen Incidents   1
1618     0   0 ...  Warhammer 40,000: Space Marine Collection   1
1619     0   0 ...            VEGAS Pro 14 Edit Steam Edition   1
1620     0   0 ...                                       ABZU   1
1621     0   0 ...                              Sacred 2 Gold   1
1622     0   0 ...                              Sakura Bundle   1
1623     1   1 ...                                   Distance   1
1624     0   0 ...                           Worms Revolution   1

[1625 rows x 5 columns]

0
投票

所以我显然对我的代码非常盲目,这可能发生在你整天盯着它的时候。实际上,我所要做的就是将“data = []”移到for循环之上,这样每次都不会重置。

© www.soinside.com 2019 - 2024. All rights reserved.