我问了this关于如何将网站中的数据提取到 Pandas (Python) 数据框中的问题。有一个非常有用的答案,适用于我直接发布到代码中的数据片段。答案帮助我理解网站上的数据是 JSON 格式。我以为我已经弄清楚了事情,但还无法直接从网站提取数据。如果有人可以帮助我找出我哪里出错了,我将不胜感激。
我需要的是在 Pandas 数据框中获取only属性名称和值。例如:
CH4 Methane
Property1 5.00000
Property2 20.00000
Property3 500.66500
Property4 100.00000
...
首先,这是我使用this建议时遇到的错误:
我无法直接发布网站链接,但我希望发布更详细的示例会有所帮助。
Traceback (most recent call last):
File "/Users/me , in <module>
data = json.load(StringIO(rawstr))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/__init__.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
所以我在网上搜索了如何从网站导入JSON格式数据的信息并尝试了这个:
import json
import pandas as pd
from io import StringIO
import requests
url = "https://testurl.com"
response = requests.request("GET", url)
data = response.json()
print(type(data)) # <class 'dict'>
data = json.dumps(data) # Convert Set to JSON string.
print(data) # Works up to here to import raw data from website.
cdata = []
for v in data["blob"]["rawLines"]:
vals = [x for x in v.strip().split(' ') if x != '']
cdata.append([vals[0], " ".join(vals[1:])])
cdata = []
for v in data:
vals = [x for x in v.strip().split(' ') if x != '']
cdata.append([vals[0], " ".join(vals[1:])])
如上面代码中的注释所示(
# Works up to here to import raw data from website.
),我能够导入原始(丑陋)数据,但只是无法弄清楚如何仅提取我需要的值。我敢打赌我的索引有问题。这是我的评论(如下)。它显示为一个很长的字符串,因为我想准确地显示我所看到的内容;但是,如果使用换行符更好,我很乐意编辑它:
{"payload": {"allShortcutsEnabled": false, "fileTree": {"": {"items": [{"name": "thing", "path": "thing", "contentType": "directory"}, {"name": ".repurlignore", "path": ".repurlignore", "contentType": "file"}, {"name": "README.md", "path": "README.md", "contentType": "file"}, {"name": "thing2", "path": "thing2", "contentType": "file"}, {"name": "thing3", "path": "thing3", "contentType": "file"}, {"name": "thing4", "path": "thing4", "contentType": "file"}, {"name": "thing5", "path": "thing5", "contentType": "file"}, {"name": "thing6", "path": "thing6", "contentType": "file"}, {"name": "thing7", "path": "thing7", "contentType": "file"}, {"name": "thing8", "path": "thing8", "contentType": "file"}, {"name": "thing9", "path": "thing9", "contentType": "file"}, {"name": "thing10", "path": "thing10", "contentType": "file"}, {"name": "thing11", "path": "thing11", "contentType": "file"}], "totalCount": 500}}, "fileTreeProcessingTime": 5.262188, "foldersToFetch": [], "reducedMotionEnabled": null, "repo": {"id": 1234567, "defaultBranch": "main", "name": "repository", "ownerLogin": "contributor", "currentUserCanPush": false, "isFork": false, "isEmpty": false, "createdAt": "2023-10-31", "ownerAvatar": "https://avatars.repurlusercontent.com/u/98765432?v=1”, "public": true, "private": false, "isOrgOwned": false}, "symbolsExpanded": false, "treeExpanded": true, "refInfo": {"name": "main", "listCacheKey": "v0:13579”, "canEdit": false, "refType": "branch", "currentOid": “identifier”2}, "path": "thing2", "currentUser": null, "blob": {"rawLines": [" C_1H_4 Methane ", " 5.00000 Property1 ", " 20.00000 Property2 ", " 500.66500 Property3 ", " 100.00000 Property4 ", " -4453.98887 Property5 ", " 100.48200 Property6 ", " 59.75258 Property7 ", " 5.33645 Property8 ", " 0.00000 Property9 ", " 645.07777 Property10 ", " 0.00000 Property11 ", " 0.00000 Property12 ", " 0.00000 Property13 ", " 0.00000 Property14 ", " 0.00000 Property15 ", " 0.00000 Property16 ", " 0.00000 Property17 ", " 0.00000 Property18 ", " 0.00000 Property19 ", " 0.00000 Property20 ", " 0.00000 Property21 ", " 0.00000 Property22 ", " 0.00000 Property23 ", " 0.00000 Property24 ", " 0.00000 Property25 ", " 0.57876 Property26 ", " 4.00000 Property27 ", " 0.00000 Property28 ", " 0.00000 Property29 ", " 0.00000 Property30 ", " 0.00000 Property31 ", " 0.00000 Property32 ", " 1.00000 Property33 ", " 0.00000 Property34 ", " 26.00000 Property35 ", " 1.44571 Property36 ", " 1.08756 Property37 ", " 0.00000 Property38 ", " 0.00000 Property39 ", " 0.00000 Property40 ", " 6.00000 Property41 ", " 9.00000 Property42 ", " 0.00000 Property43 "], "stylingDirectives": [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []], "csv": null, "csvError": null, "dependabotInfo": {"showConfigurationBanner": false, "configFilePath": null, "networkDependabotPath": "/contributor/repository/network/updates", "dismissConfigurationNoticePath": "/settings/dismiss-notice/dependabot_configuration_notice", "configurationNoticeDismissed": null, "repoAlertsPath": "/contributor/repository/security/dependabot", "repoSecurityAndAnalysisPath": "/contributor/repository/settings/security_analysis", "repoOwnerIsOrg": false, "currentUserCanAdminRepo": false}, "displayName": "thing2", "displayUrl": "https://repurl.com/contributor/repository/blob/main/thing2?raw=true", "headerInfo": {"blobSize": "3.37 KB", "deleteInfo": {"deleteTooltip": "You must be signed in to make or propose changes"}, "editInfo": {"editTooltip": “XXX”}, "ghDesktopPath": "https://desktop.repurl.com", "repurlLfsPath": null, "onBranch": true, "shortPath": “5678”, "siteNavLoginPath": "/login?return_to=identifier”, "isCSV": false, "isRichtext": false, "toc": null, "lineInfo": {"truncatedLoc": “33”, "truncatedSloc": “33”}, "mode": "executable file"}, "image": false, "isCodeownersFile": null, "isPlain": false, "isValidLegacyIssueTemplate": false, "issueTemplateHelpUrl": "https://docs.repurl.com/articles/about-issue", "issueTemplate": null, "discussionTemplate": null, "language": null, "languageID": null, "large": false, "loggedIn": false, "newDiscussionPath": "/contributor/repository/issues/new", "newIssuePath": "/contributor/repository/issues/new", "planSupportInfo": {"repoOption1”: null, "repoOption2”: null, "requestFullPath": "/contributor/repository/blob/main/thing2", "repoOption4”: null, "repoOption5”: null, "repoOption6”: null, "repoOption7”: null}, "repoOption8”: {"repoOption9”: "/settings/dismiss-notice/repoOption10”, "repoOption11”: "/settings/dismiss", "releasePath": "/contributor/repository/releases/new=true", "repoOption11: false, "repoOption12”: false}, "rawBlobUrl": "https://repurl.com/contributor/repository/raw/main/thing2", "repoOption13”: false, "richText": null, "renderedFileInfo": null, "shortPath": null, "tabSize": 8, "topBannersInfo": {"overridingGlobalFundingFile": false, "universalPath": null, "repoOwner": "contributor", "repoName": "repository", "repoOption14”: false, "citationHelpUrl": "https://docs.repurl.com/en/repurl/archiving/about", "repoOption15”: false, "repoOption16”: null}, "truncated": false, "viewable": true, "workflowRedirectUrl": null, "symbols": {"timedOut": false, "notAnalyzed": true, "symbols": []}}, "collabInfo": null, "collabMod": false, “wtsdf_signifier”: {"/contributor/repository/branches": {"post": “identifier”}, "/repos/preferences": {"post": “identifier”}}}, "title": "repository/thing2 at main \u0000 contributor/repository"}
首先,您的 json 示例无效。网上有很多 json 检查器/格式化器,下次别忘了使用。
我花了一些时间来清理它,这是清理后的版本
{
"payload": {
"allShortcutsEnabled": false,
"fileTree": {
"": {
"items": [
{
"name": "thing",
"path": "thing",
"contentType": "directory"
},
{
"name": ".repurlignore",
"path": ".repurlignore",
"contentType": "file"
},
{
"name": "README.md",
"path": "README.md",
"contentType": "file"
},
{
"name": "thing2",
"path": "thing2",
"contentType": "file"
},
{
"name": "thing3",
"path": "thing3",
"contentType": "file"
},
{
"name": "thing4",
"path": "thing4",
"contentType": "file"
},
{
"name": "thing5",
"path": "thing5",
"contentType": "file"
},
{
"name": "thing6",
"path": "thing6",
"contentType": "file"
},
{
"name": "thing7",
"path": "thing7",
"contentType": "file"
},
{
"name": "thing8",
"path": "thing8",
"contentType": "file"
},
{
"name": "thing9",
"path": "thing9",
"contentType": "file"
},
{
"name": "thing10",
"path": "thing10",
"contentType": "file"
},
{
"name": "thing11",
"path": "thing11",
"contentType": "file"
}
],
"totalCount": 500
}
},
"fileTreeProcessingTime": 5.262188,
"foldersToFetch": [],
"reducedMotionEnabled": null,
"repo": {
"id": 1234567,
"defaultBranch": "main",
"name": "repository",
"ownerLogin": "contributor",
"currentUserCanPush": false,
"isFork": false,
"isEmpty": false,
"createdAt": "2023-10-31",
"ownerAvatar": "https://avatars.repurlusercontent.com/u/98765432?v=1",
"public": true,
"private": false,
"isOrgOwned": false
},
"symbolsExpanded": false,
"treeExpanded": true,
"refInfo": {
"name": "main",
"listCacheKey": "v0:13579",
"canEdit": false,
"refType": "branch",
"currentOid": "identifier"
},
"path": "thing2",
"currentUser": null,
"blob": {
"rawLines": [
" C_1H_4 Methane ",
" 5.00000 Property1 ",
" 20.00000 Property2 ",
" 500.66500 Property3 ",
" 100.00000 Property4 ",
" -4453.98887 Property5 ",
" 100.48200 Property6 ",
" 59.75258 Property7 ",
" 5.33645 Property8 ",
" 0.00000 Property9 ",
" 645.07777 Property10 ",
" 0.00000 Property11 ",
" 0.00000 Property12 ",
" 0.00000 Property13 ",
" 0.00000 Property14 ",
" 0.00000 Property15 ",
" 0.00000 Property16 ",
" 0.00000 Property17 ",
" 0.00000 Property18 ",
" 0.00000 Property19 ",
" 0.00000 Property20 ",
" 0.00000 Property21 ",
" 0.00000 Property22 ",
" 0.00000 Property23 ",
" 0.00000 Property24 ",
" 0.00000 Property25 ",
" 0.57876 Property26 ",
" 4.00000 Property27 ",
" 0.00000 Property28 ",
" 0.00000 Property29 ",
" 0.00000 Property30 ",
" 0.00000 Property31 ",
" 0.00000 Property32 ",
" 1.00000 Property33 ",
" 0.00000 Property34 ",
" 26.00000 Property35 ",
" 1.44571 Property36 ",
" 1.08756 Property37 ",
" 0.00000 Property38 ",
" 0.00000 Property39 ",
" 0.00000 Property40 ",
" 6.00000 Property41 ",
" 9.00000 Property42 ",
" 0.00000 Property43 "
],
"stylingDirectives": [
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[],
[]
],
"csv": null,
"csvError": null,
"dependabotInfo": {
"showConfigurationBanner": false,
"configFilePath": null,
"networkDependabotPath": "/contributor/repository/network/updates",
"dismissConfigurationNoticePath": "/settings/dismiss-notice/dependabot_configuration_notice",
"configurationNoticeDismissed": null,
"repoAlertsPath": "/contributor/repository/security/dependabot",
"repoSecurityAndAnalysisPath": "/contributor/repository/settings/security_analysis",
"repoOwnerIsOrg": false,
"currentUserCanAdminRepo": false
},
"displayName": "thing2",
"displayUrl": "https://repurl.com/contributor/repository/blob/main/thing2?raw=true",
"headerInfo": {
"blobSize": "3.37 KB",
"deleteInfo": {
"deleteTooltip": "You must be signed in to make or propose changes"
},
"editInfo": {
"editTooltip": "XXX"
},
"ghDesktopPath": "https://desktop.repurl.com",
"repurlLfsPath": null,
"onBranch": true,
"shortPath": "5678",
"siteNavLoginPath": "/login?return_to=identifier",
"isCSV": false,
"isRichtext": false,
"toc": null,
"lineInfo": {
"truncatedLoc": "33",
"truncatedSloc": "33"
},
"mode": "executable file"
},
"image": false,
"isCodeownersFile": null,
"isPlain": false,
"isValidLegacyIssueTemplate": false,
"issueTemplateHelpUrl": "https://docs.repurl.com/articles/about-issue",
"issueTemplate": null,
"discussionTemplate": null,
"language": null,
"languageID": null,
"large": false,
"loggedIn": false,
"newDiscussionPath": "/contributor/repository/issues/new",
"newIssuePath": "/contributor/repository/issues/new",
"planSupportInfo": {
"repoOption1": null,
"repoOption2": null,
"requestFullPath": "/contributor/repository/blob/main/thing2",
"repoOption4": null,
"repoOption5": null,
"repoOption6": null,
"repoOption7": null
},
"repoOption8": {
"repoOption9": "/settings/dismiss-notice/repoOption10",
"releasePath": "/contributor/repository/releases/new=true",
"repoOption11": false,
"repoOption12": false
},
"rawBlobUrl": "https://repurl.com/contributor/repository/raw/main/thing2",
"repoOption13": false,
"richText": null,
"renderedFileInfo": null,
"shortPath": null,
"tabSize": 8,
"topBannersInfo": {
"overridingGlobalFundingFile": false,
"universalPath": null,
"repoOwner": "contributor",
"repoName": "repository",
"repoOption14": false,
"citationHelpUrl": "https://docs.repurl.com/en/repurl/archiving/about",
"repoOption15": false,
"repoOption16": null
},
"truncated": false,
"viewable": true,
"workflowRedirectUrl": null,
"symbols": {
"timedOut": false,
"notAnalyzed": true,
"symbols": []
}
},
"collabInfo": null,
"collabMod": false,
"wtsdf_signifier": {
"/contributor/repository/branches": {
"post": "identifier"
},
"/repos/preferences": {
"post": "identifier"
}
}
},
"title": "repository/thing2 at main \\u0000 contributor/repository"
}
原始样本中的错误:
“
→ 替换为 "
"
Error: Duplicate key 'repoOption11'
→删除第一个基于其他repoOptionXX
键的非布尔值将清理后的 json 保存在文件中后:
RawLines
部分这是我为实现此目的而编写的代码,可以对其进行大量优化和简化:
import json
import pandas as pd
f = open("yourJson.json", "r")
data = json.load(f)
f.close()
# Get what we want to extract from the json
to_extract = data["payload"]["blob"]["rawLines"]
# Remove useless whitespace
stripped = [e.strip() for e in to_extract]
trimmed = [" ".join(e.split()) for e in stripped]
# Transform the list of string to a dict
as_dict = {e.split(' ')[0]: e.split(' ')[1] for e in trimmed}
# Load the dict with pandas
df = pd.DataFrame(as_dict.items(), columns=['Value', 'Property'])
通过这段代码,我设法获得了这个 pandas 数据框,您肯定可以从中使用。
Value Property
0 C_1H_4 Methane
1 5.00000 Property1
2 20.00000 Property2
3 500.66500 Property3
4 100.00000 Property4
5 -4453.98887 Property5
6 100.48200 Property6
7 59.75258 Property7
8 5.33645 Property8
9 0.00000 Property43
10 645.07777 Property10
11 0.57876 Property26
12 4.00000 Property27
13 1.00000 Property33
14 26.00000 Property35
15 1.44571 Property36
16 1.08756 Property37
17 6.00000 Property41
18 9.00000 Property42