从 data.census.gov 县概况中抓取数据

问题描述 投票:0回答:1

我正在尝试使用 bs4 (或其他方法)编写一个程序,该程序可以从特定县的标题页中抓取数据,如下所示:https://data.census.gov/profile/Grant_County,_New_Mexico? g=050XX00US35017#人口和人民

我希望能够检索标题页中的九个统计数据,而不必遍历各自的原始 CSV 并重新计算这些值。我只从简单的 HTML 维基百科中抓取过数据,并且当数字在网页的 html 中似乎没有立即可见时,我不确定如何做到这一点(也许我在这里也需要一些 java 知识......? )感谢您在正确方向上的任何提示/推动!

import requests
import pandas as pd
from bs4 import BeautifulSoup
python selenium-webdriver web-scraping beautifulsoup census
1个回答
0
投票

尝试这样的事情:

import requests
import json

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'en-US,en;q=0.5',
}

params = {
    # Grant County, New Mexico
    'g': '050XX00US35017',
    # Los Angeles County, California
    # 'g': '050XX00US06037',
}

response = requests.get(
    'https://data.census.gov/api/profile/content/highlights',
    params=params,
    headers=headers
)

print(json.dumps(response.json(), indent=2))

县通过

g
字典中的
params
参数指定。您将在 URL 中找到此标识符。例如:

这就是格兰特县的数据。

{
  "selectedProfile": {
    "label": "Grant County, New Mexico",
    "params": {
      "g": "050XX00US35017"
    }
  },
  "highlights": [
    {
      "format": "number",
      "value": "28185",
      "topic": "Populations and People",
      "label": "Total Population",
      "sourceLink": "https://www.census.gov/programs-surveys/decennial-census/about/rdo.html",
      "source": "2020 Decennial Census",
      "tableId": "DECENNIALPL2020.P1"
    },
    {
      "format": "dollar",
      "value": "44895",
      "topic": "Income and Poverty",
      "label": "Median Household Income",
      "sourceLink": "https://www.census.gov/programs-surveys/acs.html",
      "source": "2022 American Community Survey 5-Year Estimates",
      "tableId": "ACSST5Y2022.S1901"
    },
    {
      "format": "percent",
      "value": "26.4",
      "topic": "Education",
      "label": "Bachelor's Degree or Higher",
      "sourceLink": "https://www.census.gov/programs-surveys/acs.html",
      "source": "2022 American Community Survey 5-Year Estimates",
      "tableId": "ACSST5Y2022.S1501"
    },
    {
      "format": "percent",
      "value": "41.9",
      "topic": "Employment",
      "label": "Employment Rate",
      "sourceLink": "https://www.census.gov/programs-surveys/acs.html",
      "source": "2022 American Community Survey 5-Year Estimates",
      "tableId": "ACSDP5Y2022.DP03"
    },
    {
      "format": "number",
      "value": "14584",
      "topic": "Housing",
      "label": "Total Housing Units",
      "sourceLink": "https://www.census.gov/programs-surveys/decennial-census/about/rdo.html",
      "source": "2020 Decennial Census",
      "tableId": "DECENNIALPL2020.H1"
    },
    {
      "format": "percent",
      "value": "5.1",
      "topic": "Health",
      "label": "Without Health Care Coverage",
      "sourceLink": "https://www.census.gov/programs-surveys/acs.html",
      "source": "2022 American Community Survey 5-Year Estimates",
      "tableId": "ACSST5Y2022.S2701"
    },
    {
      "format": "number",
      "value": "534",
      "topic": "Business and Economy",
      "label": "Total Employer Establishments",
      "sourceLink": "https://www.census.gov/programs-surveys/cbp.html",
      "source": "2021 Economic Surveys Business Patterns",
      "tableId": "CBP2021.CB2100CBP"
    },
    {
      "format": "number",
      "value": "11292",
      "topic": "Families and Living Arrangements",
      "label": "Total Households",
      "sourceLink": "https://www.census.gov/programs-surveys/acs.html",
      "source": "2022 American Community Survey 5-Year Estimates",
      "tableId": "ACSDP5Y2022.DP02"
    },
    {
      "format": "number",
      "value": "13466",
      "topic": "Race and Ethnicity",
      "label": "Hispanic or Latino (of any race)",
      "sourceLink": "https://www.census.gov/programs-surveys/decennial-census/about/rdo.html",
      "source": "2020 Decennial Census",
      "tableId": "DECENNIALDHC2020.P9"
    }
  ]
}

如果您想从描述符段落中获取面积(请参阅下面的评论),那么您可以添加以下内容:

params = {
    'g': '050XX00US35017',
    'includeHighlights': 'false',
}

response = requests.get(
    "https://data.census.gov/api/profile/metadata",
    params=params,
    headers=headers
)

print(response.json()["header"]["description"])

这将返回完整的段落文本,您需要使用字符串操作来提取区域。

© www.soinside.com 2019 - 2024. All rights reserved.