如何从网站获取 JSON 格式的数据到数据帧中?即,提取标识符后面的值?

问题描述 投票:0回答:1

我问了this关于如何将网站中的数据提取到 Pandas (Python) 数据框中的问题。有一个非常有用的答案,适用于我直接发布到代码中的数据片段。答案帮助我理解网站上的数据是 JSON 格式。我以为我已经弄清楚了事情,但还无法直接从网站提取数据。如果有人可以帮助我找出我哪里出错了,我将不胜感激。

我需要的是在 Pandas 数据框中获取only属性名称和值。例如:

CH4   Methane                  
    Property1       5.00000                                    
    Property2       20.00000                     
    Property3       500.66500                              
    Property4       100.00000                                           
    ...

首先,这是我使用this建议时遇到的错误:

我无法直接发布网站链接,但我希望发布更详细的示例会有所帮助。

Traceback (most recent call last):
  File "/Users/me , in <module>

    data = json.load(StringIO(rawstr))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/__init__.py", line 293, in load
    return loads(fp.read(),
           ^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

所以我在网上搜索了如何从网站导入JSON格式数据的信息并尝试了这个:

import json
import pandas as pd
from io import StringIO
import requests     

url = "https://testurl.com"
response = requests.request("GET", url)

data = response.json()
print(type(data))  # <class 'dict'>
data = json.dumps(data) # Convert Set to JSON string.
print(data) # Works up to here to import raw data from website.

cdata = []
for v in data["blob"]["rawLines"]:
    vals = [x for x in v.strip().split(' ') if x != '']
    cdata.append([vals[0], " ".join(vals[1:])])

cdata = []
for v in data:
    vals = [x for x in v.strip().split(' ') if x != '']
    cdata.append([vals[0], " ".join(vals[1:])])

如上面代码中的注释所示(

# Works up to here to import raw data from website.
),我能够导入原始(丑陋)数据,但只是无法弄清楚如何仅提取我需要的值。我敢打赌我的索引有问题。这是我的评论(如下)。它显示为一个很长的字符串,因为我想准确地显示我所看到的内容;但是,如果使用换行符更好,我很乐意编辑它:

{"payload": {"allShortcutsEnabled": false, "fileTree": {"": {"items": [{"name": "thing", "path": "thing", "contentType": "directory"}, {"name": ".repurlignore", "path": ".repurlignore", "contentType": "file"}, {"name": "README.md", "path": "README.md", "contentType": "file"}, {"name": "thing2", "path": "thing2", "contentType": "file"}, {"name": "thing3", "path": "thing3", "contentType": "file"}, {"name": "thing4", "path": "thing4", "contentType": "file"}, {"name": "thing5", "path": "thing5", "contentType": "file"}, {"name": "thing6", "path": "thing6", "contentType": "file"}, {"name": "thing7", "path": "thing7", "contentType": "file"}, {"name": "thing8", "path": "thing8", "contentType": "file"}, {"name": "thing9", "path": "thing9", "contentType": "file"}, {"name": "thing10", "path": "thing10", "contentType": "file"}, {"name": "thing11", "path": "thing11", "contentType": "file"}], "totalCount": 500}}, "fileTreeProcessingTime": 5.262188, "foldersToFetch": [], "reducedMotionEnabled": null, "repo": {"id": 1234567, "defaultBranch": "main", "name": "repository", "ownerLogin": "contributor", "currentUserCanPush": false, "isFork": false, "isEmpty": false, "createdAt": "2023-10-31", "ownerAvatar": "https://avatars.repurlusercontent.com/u/98765432?v=1”, "public": true, "private": false, "isOrgOwned": false}, "symbolsExpanded": false, "treeExpanded": true, "refInfo": {"name": "main", "listCacheKey": "v0:13579”, "canEdit": false, "refType": "branch", "currentOid": “identifier”2}, "path": "thing2", "currentUser": null, "blob": {"rawLines": ["        C_1H_4   Methane                  ", "            5.00000        Property1                             ", "             20.00000        Property2                     ", "           500.66500        Property3                              ", "           100.00000        Property4                                           ", "         -4453.98887        Property5                                      ", "           100.48200        Property6                                   ", "            59.75258        Property7                                         ", "             5.33645        Property8                                         ", "             0.00000        Property9         ", "           645.07777        Property10                                       ", "             0.00000        Property11                           ", "             0.00000        Property12                           ", "             0.00000        Property13                             ", "             0.00000        Property14                             ", "             0.00000        Property15                             ", "             0.00000        Property16                             ", "             0.00000        Property17                   ", "             0.00000        Property18                            ", "             0.00000        Property19                   ", "             0.00000        Property20                             ", "             0.00000        Property21                   ", "             0.00000        Property22                             ", "             0.00000        Property23                   ", "             0.00000        Property24                    ", "             0.00000        Property25                    ", "             0.57876        Property26                                           ", "             4.00000        Property27                                               ", "             0.00000        Property28                    ", "             0.00000        Property29               ", "             0.00000        Property30                  ", "             0.00000        Property31            ", "             0.00000        Property32                  ", "             1.00000        Property33                         ", "             0.00000        Property34                       ", "            26.00000        Property35                             ", "             1.44571        Property36                               ", "             1.08756        Property37                            ", "             0.00000        Property38                          ", "             0.00000        Property39                        ", "             0.00000        Property40                        ", "             6.00000        Property41                       ", "             9.00000        Property42                                         ", "             0.00000        Property43                                         "], "stylingDirectives": [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []], "csv": null, "csvError": null, "dependabotInfo": {"showConfigurationBanner": false, "configFilePath": null, "networkDependabotPath": "/contributor/repository/network/updates", "dismissConfigurationNoticePath": "/settings/dismiss-notice/dependabot_configuration_notice", "configurationNoticeDismissed": null, "repoAlertsPath": "/contributor/repository/security/dependabot", "repoSecurityAndAnalysisPath": "/contributor/repository/settings/security_analysis", "repoOwnerIsOrg": false, "currentUserCanAdminRepo": false}, "displayName": "thing2", "displayUrl": "https://repurl.com/contributor/repository/blob/main/thing2?raw=true", "headerInfo": {"blobSize": "3.37 KB", "deleteInfo": {"deleteTooltip": "You must be signed in to make or propose changes"}, "editInfo": {"editTooltip": “XXX”}, "ghDesktopPath": "https://desktop.repurl.com", "repurlLfsPath": null, "onBranch": true, "shortPath": “5678”, "siteNavLoginPath": "/login?return_to=identifier”, "isCSV": false, "isRichtext": false, "toc": null, "lineInfo": {"truncatedLoc": “33”, "truncatedSloc": “33”}, "mode": "executable file"}, "image": false, "isCodeownersFile": null, "isPlain": false, "isValidLegacyIssueTemplate": false, "issueTemplateHelpUrl": "https://docs.repurl.com/articles/about-issue", "issueTemplate": null, "discussionTemplate": null, "language": null, "languageID": null, "large": false, "loggedIn": false, "newDiscussionPath": "/contributor/repository/issues/new", "newIssuePath": "/contributor/repository/issues/new", "planSupportInfo": {"repoOption1”: null, "repoOption2”: null, "requestFullPath": "/contributor/repository/blob/main/thing2", "repoOption4”: null, "repoOption5”: null, "repoOption6”: null, "repoOption7”: null}, "repoOption8”: {"repoOption9”: "/settings/dismiss-notice/repoOption10”, "repoOption11”: "/settings/dismiss", "releasePath": "/contributor/repository/releases/new=true", "repoOption11: false, "repoOption12”: false}, "rawBlobUrl": "https://repurl.com/contributor/repository/raw/main/thing2", "repoOption13”: false, "richText": null, "renderedFileInfo": null, "shortPath": null, "tabSize": 8, "topBannersInfo": {"overridingGlobalFundingFile": false, "universalPath": null, "repoOwner": "contributor", "repoName": "repository", "repoOption14”: false, "citationHelpUrl": "https://docs.repurl.com/en/repurl/archiving/about", "repoOption15”: false, "repoOption16”: null}, "truncated": false, "viewable": true, "workflowRedirectUrl": null, "symbols": {"timedOut": false, "notAnalyzed": true, "symbols": []}}, "collabInfo": null, "collabMod": false, “wtsdf_signifier”: {"/contributor/repository/branches": {"post": “identifier”}, "/repos/preferences": {"post": “identifier”}}}, "title": "repository/thing2 at main \u0000 contributor/repository"}
python json python-3.x pandas file-io
1个回答
0
投票

无效的 Json

首先,您的 json 示例无效。网上有很多 json 检查器/格式化器,下次别忘了使用。

我花了一些时间来清理它,这是清理后的版本

{
    "payload": {
        "allShortcutsEnabled": false,
        "fileTree": {
            "": {
                "items": [
                    {
                        "name": "thing",
                        "path": "thing",
                        "contentType": "directory"
                    },
                    {
                        "name": ".repurlignore",
                        "path": ".repurlignore",
                        "contentType": "file"
                    },
                    {
                        "name": "README.md",
                        "path": "README.md",
                        "contentType": "file"
                    },
                    {
                        "name": "thing2",
                        "path": "thing2",
                        "contentType": "file"
                    },
                    {
                        "name": "thing3",
                        "path": "thing3",
                        "contentType": "file"
                    },
                    {
                        "name": "thing4",
                        "path": "thing4",
                        "contentType": "file"
                    },
                    {
                        "name": "thing5",
                        "path": "thing5",
                        "contentType": "file"
                    },
                    {
                        "name": "thing6",
                        "path": "thing6",
                        "contentType": "file"
                    },
                    {
                        "name": "thing7",
                        "path": "thing7",
                        "contentType": "file"
                    },
                    {
                        "name": "thing8",
                        "path": "thing8",
                        "contentType": "file"
                    },
                    {
                        "name": "thing9",
                        "path": "thing9",
                        "contentType": "file"
                    },
                    {
                        "name": "thing10",
                        "path": "thing10",
                        "contentType": "file"
                    },
                    {
                        "name": "thing11",
                        "path": "thing11",
                        "contentType": "file"
                    }
                ],
                "totalCount": 500
            }
        },
        "fileTreeProcessingTime": 5.262188,
        "foldersToFetch": [],
        "reducedMotionEnabled": null,
        "repo": {
            "id": 1234567,
            "defaultBranch": "main",
            "name": "repository",
            "ownerLogin": "contributor",
            "currentUserCanPush": false,
            "isFork": false,
            "isEmpty": false,
            "createdAt": "2023-10-31",
            "ownerAvatar": "https://avatars.repurlusercontent.com/u/98765432?v=1",
            "public": true,
            "private": false,
            "isOrgOwned": false
        },
        "symbolsExpanded": false,
        "treeExpanded": true,
        "refInfo": {
            "name": "main",
            "listCacheKey": "v0:13579",
            "canEdit": false,
            "refType": "branch",
            "currentOid": "identifier"
        },
        "path": "thing2",
        "currentUser": null,
        "blob": {
            "rawLines": [
                "        C_1H_4   Methane                  ",
                "            5.00000        Property1                             ",
                "             20.00000        Property2                     ",
                "           500.66500        Property3                              ",
                "           100.00000        Property4                                           ",
                "         -4453.98887        Property5                                      ",
                "           100.48200        Property6                                   ",
                "            59.75258        Property7                                         ",
                "             5.33645        Property8                                         ",
                "             0.00000        Property9         ",
                "           645.07777        Property10                                       ",
                "             0.00000        Property11                           ",
                "             0.00000        Property12                           ",
                "             0.00000        Property13                             ",
                "             0.00000        Property14                             ",
                "             0.00000        Property15                             ",
                "             0.00000        Property16                             ",
                "             0.00000        Property17                   ",
                "             0.00000        Property18                            ",
                "             0.00000        Property19                   ",
                "             0.00000        Property20                             ",
                "             0.00000        Property21                   ",
                "             0.00000        Property22                             ",
                "             0.00000        Property23                   ",
                "             0.00000        Property24                    ",
                "             0.00000        Property25                    ",
                "             0.57876        Property26                                           ",
                "             4.00000        Property27                                               ",
                "             0.00000        Property28                    ",
                "             0.00000        Property29               ",
                "             0.00000        Property30                  ",
                "             0.00000        Property31            ",
                "             0.00000        Property32                  ",
                "             1.00000        Property33                         ",
                "             0.00000        Property34                       ",
                "            26.00000        Property35                             ",
                "             1.44571        Property36                               ",
                "             1.08756        Property37                            ",
                "             0.00000        Property38                          ",
                "             0.00000        Property39                        ",
                "             0.00000        Property40                        ",
                "             6.00000        Property41                       ",
                "             9.00000        Property42                                         ",
                "             0.00000        Property43                                         "
            ],
            "stylingDirectives": [
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                []
            ],
            "csv": null,
            "csvError": null,
            "dependabotInfo": {
                "showConfigurationBanner": false,
                "configFilePath": null,
                "networkDependabotPath": "/contributor/repository/network/updates",
                "dismissConfigurationNoticePath": "/settings/dismiss-notice/dependabot_configuration_notice",
                "configurationNoticeDismissed": null,
                "repoAlertsPath": "/contributor/repository/security/dependabot",
                "repoSecurityAndAnalysisPath": "/contributor/repository/settings/security_analysis",
                "repoOwnerIsOrg": false,
                "currentUserCanAdminRepo": false
            },
            "displayName": "thing2",
            "displayUrl": "https://repurl.com/contributor/repository/blob/main/thing2?raw=true",
            "headerInfo": {
                "blobSize": "3.37 KB",
                "deleteInfo": {
                    "deleteTooltip": "You must be signed in to make or propose changes"
                },
                "editInfo": {
                    "editTooltip": "XXX"
                },
                "ghDesktopPath": "https://desktop.repurl.com",
                "repurlLfsPath": null,
                "onBranch": true,
                "shortPath": "5678",
                "siteNavLoginPath": "/login?return_to=identifier",
                "isCSV": false,
                "isRichtext": false,
                "toc": null,
                "lineInfo": {
                    "truncatedLoc": "33",
                    "truncatedSloc": "33"
                },
                "mode": "executable file"
            },
            "image": false,
            "isCodeownersFile": null,
            "isPlain": false,
            "isValidLegacyIssueTemplate": false,
            "issueTemplateHelpUrl": "https://docs.repurl.com/articles/about-issue",
            "issueTemplate": null,
            "discussionTemplate": null,
            "language": null,
            "languageID": null,
            "large": false,
            "loggedIn": false,
            "newDiscussionPath": "/contributor/repository/issues/new",
            "newIssuePath": "/contributor/repository/issues/new",
            "planSupportInfo": {
                "repoOption1": null,
                "repoOption2": null,
                "requestFullPath": "/contributor/repository/blob/main/thing2",
                "repoOption4": null,
                "repoOption5": null,
                "repoOption6": null,
                "repoOption7": null
            },
            "repoOption8": {
                "repoOption9": "/settings/dismiss-notice/repoOption10",
                "releasePath": "/contributor/repository/releases/new=true",
                "repoOption11": false,
                "repoOption12": false
            },
            "rawBlobUrl": "https://repurl.com/contributor/repository/raw/main/thing2",
            "repoOption13": false,
            "richText": null,
            "renderedFileInfo": null,
            "shortPath": null,
            "tabSize": 8,
            "topBannersInfo": {
                "overridingGlobalFundingFile": false,
                "universalPath": null,
                "repoOwner": "contributor",
                "repoName": "repository",
                "repoOption14": false,
                "citationHelpUrl": "https://docs.repurl.com/en/repurl/archiving/about",
                "repoOption15": false,
                "repoOption16": null
            },
            "truncated": false,
            "viewable": true,
            "workflowRedirectUrl": null,
            "symbols": {
                "timedOut": false,
                "notAnalyzed": true,
                "symbols": []
            }
        },
        "collabInfo": null,
        "collabMod": false,
        "wtsdf_signifier": {
            "/contributor/repository/branches": {
                "post": "identifier"
            },
            "/repos/preferences": {
                "post": "identifier"
            }
        }
    },
    "title": "repository/thing2 at main \\u0000 contributor/repository"
}

原始样本中的错误:

  • 无效字符
    → 替换为
    "
  • “currentOid”:“标识符”2→“currentOid”:“标识符”
  • “repoOption11 → 缺失
    "
  • Error: Duplicate key 'repoOption11'
    →删除第一个基于其他
    repoOptionXX
    键的非布尔值

如何加载?

将清理后的 json 保存在文件中后:

  • 读取json文件
  • 获取
    RawLines
    部分
  • 从字符串列表中删除所有无用的空格
  • 将列表中的每个值拆分为字典的键、值对
  • 与 pandas 一起阅读字典

这是我为实现此目的而编写的代码,可以对其进行大量优化和简化:

import json
import pandas as pd

f = open("yourJson.json", "r")
data = json.load(f)
f.close()

# Get what we want to extract from the json
to_extract = data["payload"]["blob"]["rawLines"]

# Remove useless whitespace
stripped = [e.strip() for e in to_extract]
trimmed = [" ".join(e.split()) for e in stripped]

# Transform the list of string to a dict
as_dict = {e.split(' ')[0]: e.split(' ')[1] for e in trimmed}

# Load the dict with pandas
df = pd.DataFrame(as_dict.items(), columns=['Value', 'Property'])

通过这段代码,我设法获得了这个 pandas 数据框,您肯定可以从中使用。

          Value    Property
0        C_1H_4     Methane
1       5.00000   Property1
2      20.00000   Property2
3     500.66500   Property3
4     100.00000   Property4
5   -4453.98887   Property5
6     100.48200   Property6
7      59.75258   Property7
8       5.33645   Property8
9       0.00000  Property43
10    645.07777  Property10
11      0.57876  Property26
12      4.00000  Property27
13      1.00000  Property33
14     26.00000  Property35
15      1.44571  Property36
16      1.08756  Property37
17      6.00000  Property41
18      9.00000  Property42
© www.soinside.com 2019 - 2024. All rights reserved.