我正在尝试使用 bs4 (或其他方法)编写一个程序,该程序可以从特定县的标题页中抓取数据,如下所示:https://data.census.gov/profile/Grant_County,_New_Mexico? g=050XX00US35017#人口和人民
我希望能够检索标题页中的九个统计数据,而不必遍历各自的原始 CSV 并重新计算这些值。我只从简单的 HTML 维基百科中抓取过数据,并且当数字在网页的 html 中似乎没有立即可见时,我不确定如何做到这一点(也许我在这里也需要一些 java 知识......? )感谢您在正确方向上的任何提示/推动!
import requests
import pandas as pd
from bs4 import BeautifulSoup
尝试这样的事情:
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.5',
}
params = {
# Grant County, New Mexico
'g': '050XX00US35017',
# Los Angeles County, California
# 'g': '050XX00US06037',
}
response = requests.get(
'https://data.census.gov/api/profile/content/highlights',
params=params,
headers=headers
)
print(json.dumps(response.json(), indent=2))
县通过
g
字典中的 params
参数指定。您将在 URL 中找到此标识符。例如:
这就是格兰特县的数据。
{
"selectedProfile": {
"label": "Grant County, New Mexico",
"params": {
"g": "050XX00US35017"
}
},
"highlights": [
{
"format": "number",
"value": "28185",
"topic": "Populations and People",
"label": "Total Population",
"sourceLink": "https://www.census.gov/programs-surveys/decennial-census/about/rdo.html",
"source": "2020 Decennial Census",
"tableId": "DECENNIALPL2020.P1"
},
{
"format": "dollar",
"value": "44895",
"topic": "Income and Poverty",
"label": "Median Household Income",
"sourceLink": "https://www.census.gov/programs-surveys/acs.html",
"source": "2022 American Community Survey 5-Year Estimates",
"tableId": "ACSST5Y2022.S1901"
},
{
"format": "percent",
"value": "26.4",
"topic": "Education",
"label": "Bachelor's Degree or Higher",
"sourceLink": "https://www.census.gov/programs-surveys/acs.html",
"source": "2022 American Community Survey 5-Year Estimates",
"tableId": "ACSST5Y2022.S1501"
},
{
"format": "percent",
"value": "41.9",
"topic": "Employment",
"label": "Employment Rate",
"sourceLink": "https://www.census.gov/programs-surveys/acs.html",
"source": "2022 American Community Survey 5-Year Estimates",
"tableId": "ACSDP5Y2022.DP03"
},
{
"format": "number",
"value": "14584",
"topic": "Housing",
"label": "Total Housing Units",
"sourceLink": "https://www.census.gov/programs-surveys/decennial-census/about/rdo.html",
"source": "2020 Decennial Census",
"tableId": "DECENNIALPL2020.H1"
},
{
"format": "percent",
"value": "5.1",
"topic": "Health",
"label": "Without Health Care Coverage",
"sourceLink": "https://www.census.gov/programs-surveys/acs.html",
"source": "2022 American Community Survey 5-Year Estimates",
"tableId": "ACSST5Y2022.S2701"
},
{
"format": "number",
"value": "534",
"topic": "Business and Economy",
"label": "Total Employer Establishments",
"sourceLink": "https://www.census.gov/programs-surveys/cbp.html",
"source": "2021 Economic Surveys Business Patterns",
"tableId": "CBP2021.CB2100CBP"
},
{
"format": "number",
"value": "11292",
"topic": "Families and Living Arrangements",
"label": "Total Households",
"sourceLink": "https://www.census.gov/programs-surveys/acs.html",
"source": "2022 American Community Survey 5-Year Estimates",
"tableId": "ACSDP5Y2022.DP02"
},
{
"format": "number",
"value": "13466",
"topic": "Race and Ethnicity",
"label": "Hispanic or Latino (of any race)",
"sourceLink": "https://www.census.gov/programs-surveys/decennial-census/about/rdo.html",
"source": "2020 Decennial Census",
"tableId": "DECENNIALDHC2020.P9"
}
]
}
如果您想从描述符段落中获取面积(请参阅下面的评论),那么您可以添加以下内容:
params = {
'g': '050XX00US35017',
'includeHighlights': 'false',
}
response = requests.get(
"https://data.census.gov/api/profile/metadata",
params=params,
headers=headers
)
print(response.json()["header"]["description"])
这将返回完整的段落文本,您需要使用字符串操作来提取区域。