如何有效地抓取数据并清理它

问题描述 投票:0回答:1

我从网站上抓取数据,但无法清理它 这是我用来抓取数据的代码,这是最佳实践吗?

import requests
from bs4 import BeautifulSoup
import json

all_countries_links=[]
countries= []
all_data=[]
data_dict={}
data_value=[]
page1 = requests.get(f"https://data.un.org/")



def main(page):
    source = page.content
    soup = BeautifulSoup(source,'lxml')
    all_page = soup.find("div",{"class","CountryList"}).find_all('a',href=True)
    for link in all_page:
        all_countries_links.append(link['href'])
        countries. append(link.text.strip())

def scrape_country(all_countries_links,countries):
     for country in all_countries_links[:2]:
        page2 = requests.get(f"https://data.un.org/{country}") 
        source = page2.content
        soup = BeautifulSoup(source,'lxml')
        all_page= soup.find('ul',{'class','pure-menu-list'})
        tables = all_page.contents
        for table in tables:
            line = table.text.strip()
            all_data.append(line)
main(page1)
scrape_country(all_countries_links,countries)
file_path = "data.json"
with open(file_path, 'w') as f:
    json.dump(all_data, f, indent=4) 
print(f"Data saved to {file_path}")

这是收集数据后的一个小例子

[
    "",
    "General Information\n\nRegion\u00a0\n\u00a0\nSouthern Asia\nPopulation\u00a0(000, 2021)\n\u00a0\n39 835a\nPop. density\u00a0(per km2, 2021)\n\u00a0\n61a\nCapital city\u00a0\n\u00a0\nKabul\nCapital city pop.\u00a0(000, 2021)\n\u00a0\n4 114.0b\nUN membership date\u00a0\n\u00a0\n19-Nov-46\nSurface area\u00a0(km2)\n\u00a0\n652 864b\nSex ratio\u00a0(m per 100 f)\n\u00a0\n105.3a\nNational currency\u00a0\n\u00a0\nAfghani (AFN)\nExchange rate\u00a0(per US$)\n\u00a0\n77.1c",   
]

我尝试用这段代码分离数据

cleaned_data =[]

# for line in cleaned_data:
#     print(line.split('\n'))
# new_data = [line for line in all_data.split()]

for line in all_data[:1]:
    for line2 in line.split():
        if line2 not in ["General","Information","Economic"," indicators","Social"," indicators"]:
            cleaned_data.append(line2)

但我希望找到更好的方法

python json database beautifulsoup data-cleaning
1个回答
0
投票

对于此类任务,我推荐

pandas
.read_html()
功能:

from io import StringIO

import pandas as pd
import requests
from bs4 import BeautifulSoup

country_url = "https://data.un.org/en/iso/af.html"

soup = BeautifulSoup(requests.get(country_url).content, "html.parser")

for table in soup.select("details table"):
    summary = table.find_previous("summary").text
    df = pd.read_html(StringIO(str(table)))[0]
    df["table_name"] = summary
    print(df)
    print("-" * 80)

打印:


...

--------------------------------------------------------------------------------
                                               Unnamed: 0         2010         2015          2021           table_name
0       GDP: Gross domestic product (million current US$)       14 699       18 713       17 877b  Economic indicators
1          GDP growth rate (annual %, const. 2015 prices)          5.2         -1.4            4b  Economic indicators
2                            GDP per capita (current US$)        503.6        543.8        469.9b  Economic indicators
3           Economy: Agriculture (% of Gross Value Added)         33.2         27.3       26.9d,b  Economic indicators
4              Economy: Industry (% of Gross Value Added)           13         10.8     12.8e,f,b  Economic indicators
5         Economy: Services and other activity (% of GVA)         53.8         61.9       60.4g,b  Economic indicators
6              Employment in agricultureh (% of employed)         54.7         47.1         42.4c  Economic indicators
7                 Employment in industryh (% of employed)         14.4           17         18.3c  Economic indicators
8                    Employment in servicesh (% employed)         30.9         35.8         39.4c  Economic indicators
9                       Unemploymenth (% of labour force)         11.5         11.4         11.2c  Economic indicators
10  Labour force participation rateh (female/male pop. %)  14.9 / 78.4  18.8 / 76.2  21.8 / 74.6c  Economic indicators
11                   CPI: Consumer Price Index (2010=100)          100          133          150b  Economic indicators
12          Agricultural production index (2014-2016=100)           93           96          111b  Economic indicators
13     International trade: exports (million current US$)          388          571      1 022h,c  Economic indicators
14     International trade: imports (million current US$)        5 154        7 723      9 683h,c  Economic indicators
15     International trade: balance (million current US$)      - 4 766      - 7 151    - 8 661h,c  Economic indicators
16     Balance of payments, current account (million US$)         -578      - 4 193      - 3 137c  Economic indicators
--------------------------------------------------------------------------------
                                                          Unnamed: 0          2010          2015           2021         table_name
0                         Population growth ratei (average annual %)           2.6           3.3           2.5c  Social indicators
1                           Urban population (% of total population)          23.7          24.8          25.8b  Social indicators
2                   Urban population growth ratei (average annual %)           3.7             4            ...  Social indicators
3                     Fertility rate, totali (live births per woman)           6.5           5.4           4.6c  Social indicators
4                   Life expectancy at birthi (females/males, years)   61.0 / 58.3   63.8 / 60.9   65.8 / 62.8c  Social indicators
5                Population age distribution (0-14/60+ years old, %)    48.2 / 3.9    44.9 / 4.0    41.2 / 4.3a  Social indicators
6                 International migrant stockj (000/% of total pop.)   102.3 / 0.4   339.4 / 1.0   144.1 / 0.4c  Social indicators
7                      Refugees and others of concern to UNHCR (000)      1 200.0k       1 421.4       2 802.9c  Social indicators
8                     Infant mortality ratei (per 1 000 live births)          72.2          60.1          51.7c  Social indicators
9                             Health: Current expenditure (% of GDP)           8.6          10.1           9.4l  Social indicators
10                               Health: Physicians (per 1 000 pop.)           0.2           0.3           0.3m  Social indicators
11                      Education: Government expenditure (% of GDP)           3.5           3.3         4.1h,n  Social indicators
12          Education: Primary gross enrol. ratio (f/m per 100 pop.)  80.6 / 118.6  83.5 / 122.7  82.9 / 124.2l  Social indicators
13        Education: Secondary gross enrol. ratio (f/m per 100 pop.)   33.3 / 66.9   36.8 / 65.9   40.0 / 70.1l  Social indicators
14  Education: Upper secondary gross enrol. ratio (f/m per 100 pop.)   17.8 / 42.7   27.1 / 52.6   28.5 / 52.4l  Social indicators
15                      Intentional homicide rate (per 100 000 pop.)           3.4           9.8           6.7l  Social indicators
16                   Seats held by women in national parliaments (%)          27.3          27.7            27o  Social indicators
--------------------------------------------------------------------------------

...
© www.soinside.com 2019 - 2024. All rights reserved.