我正在使用 Snapshot Serengeti 数据集进行图像分类项目。该数据集附带一个非常大的 JSON 文件 (5GB+),其中包含顶级键。我特别需要 "images": [{...}, {...}, ...] 数组中包含的值进行训练。该文件太大,我无法直接打开并读取或存储到字典中。
文件中的图像条目格式如下:
{
"id": "S1/B04/B04_R1/S1_B04_R1_PICT0003",
"file_name": "S1/B04/B04_R1/S1_B04_R1_PICT0003.JPG",
"frame_num": 1,
"seq_id": "SER_S1#B04#1#3",
"width": 2048,
"height": 1536,
"corrupt": false,
"location": "B04",
"seq_num_frames": 1,
"datetime": "2010-07-20 06:14:06"
},
我试图以 100MB 的块循环遍历文件,但文件也有格式问题(单引号、NaN 值),需要首先解决,否则会抛出错误。我试过的代码如下
with open(labels_json) as f:
for chunk in iter(lambda: f.read(100*1024*1024), ""):
data = json.loads(chunk)
由于图像被组织成 11 个季节,我尝试将数据写入 11 个单独的文件,这些文件可以使用下面的脚本单独加载,但是云存储甚至在存储一个季节之前就被吃光了。我对这样的数据存储问题不熟悉,所以我的脚本中肯定存在导致文件写入效率低下的问题。非常感谢任何帮助。
import json
labels_json = annotations_directory + "SS_Labels.json"
get_filename = lambda n : f"SS_labels_S{i}.json"
# Define the 11 output files
seasons = {}
started = {}
for i in range(1, 12):
filename = get_filename(i)
seasons[i] = open(filename, "w")
seasons[i].write('[')
started[i] = False
def seperate_seasons(dir):
line_num = 0
decoder = json.JSONDecoder()
with open(dir, 'r') as labels:
begin_writing = False
buffer = []
id = 1
for line in labels:
if not begin_writing: # Begin writing for the line after "images"
if 'images' in line:
begin_writing = True
else:
line.replace('NaN', 'null') # clean NaN values
line.replace("'", '"') # clean incorrect key values
buffer.append(line.strip()) # add line to buffer
getID = lambda l: int(line.split('"')[3].split('/')[0][1])
if '"id"' in line or "'id'" in line:
previous_id = id
id = getID(line) # get id of object
if line.strip() == '},' or line.strip() == '}': # when the object has finished, write it to the appropriate image folder
label = ','.join(buffer)
if label[-1] != ',':
label += ','
if started[id] == False:
print(f'Beginning Season {id}')
started[id] = True
if id != 1:
seasons[previous_id].write(']')
seasons[previous_id].close()
del seasons[previous_id]
seasons[id].write(label) # add label entry to file
seperate_seasons(labels_json)
# Close all remaining label files
for season in seasons.values():
season.write(']')
season.close()
如果您没有 RAM 来将文件加载到内存中(如果您没有,我不会怪您),您可以使用几个额外的库将数据拆分为更易于管理的文件。
json-stream
用于 JSON 流式传输,以及 orjson
用于更快的 JSON 编码器,以及 tqdm
用于进度条。
输入是原始的单个 JSON 文件,输出文件夹
out/
最终将包含来自 JSON 的信息和类别数据,以及 JSONL(又名 JSON 行,又名 ND-JSON)文件(即 JSON 对象,每行一个)ála
{"id":"S1/B04/B04_R1/S1_B04_R1_PICT0001","file_name":"S1/B04/B04_R1/S1_B04_R1_PICT0001.JPG","frame_num":1,"seq_id":"SER_S1#B04#1#1","width":2048,"height":1536,"corrupt":false,"location":"B04","seq_num_frames":1,"datetime":"2010-07-18 16:26:14"}
{"id":"S1/B04/B04_R1/S1_B04_R1_PICT0002","file_name":"S1/B04/B04_R1/S1_B04_R1_PICT0002.JPG","frame_num":1,"seq_id":"SER_S1#B04#1#2","width":2048,"height":1536,"corrupt":false,"location":"B04","seq_num_frames":1,"datetime":"2010-07-18 16:26:30"}
{"id":"S1/B04/B04_R1/S1_B04_R1_PICT0003","file_name":"S1/B04/B04_R1/S1_B04_R1_PICT0003.JPG","frame_num":1,"seq_id":"SER_S1#B04#1#3","width":2048,"height":1536,"corrupt":false,"location":"B04","seq_num_frames":1,"datetime":"2010-07-20 06:14:06"}
{"id":"S1/B04/B04_R1/S1_B04_R1_PICT0004","file_name":"S1/B04/B04_R1/S1_B04_R1_PICT0004.JPG","frame_num":1,"seq_id":"SER_S1#B04#1#4","width":2048,"height":1536,"corrupt":false,"location":"B04","seq_num_frames":1,"datetime":"2010-07-22 08:56:06"}
JSONL 文件很容易被许多工具处理,并且也可以使用简单的 for 循环在 Python 中进行解析。如果愿意,您可以将
open
替换为 gzip.open
以随时压缩 JSONL 文件。
json_stream
API 有点挑剔,但在这里你可以 - 在我的机器上工作(json-stream==2.3.0)。
在我的笔记本电脑上,tqdm 报告每秒处理 29594 张图像。
import json_stream
import orjson
import tqdm
def read_images(value):
jsonl_files = {}
with tqdm.tqdm(value, unit="image") as pbar:
for image in pbar:
image = dict(image)
prefix = "/".join(image["id"].split("/")[:2])
filename = f"out/{prefix.replace('/', '_')}.jsonl"
if filename not in jsonl_files:
jsonl_files[filename] = open(filename, "ab")
if len(jsonl_files) > 50:
jsonl_files.popitem()[1].close()
pbar.set_description(f"Writing {filename}")
jsonl_files[filename].write(orjson.dumps(image))
jsonl_files[filename].write(b"\n")
def main():
with open("/Users/akx/Downloads/SnapshotSerengeti_S1-11_v2.1.json", "rb") as f:
data = json_stream.load(f)
for key, value in data.items():
if key == "info":
value = dict(value.persistent().items())
with open(f"out/info.json", "wb") as f:
f.write(orjson.dumps(value))
elif key == "categories":
value = [dict(d) for d in value.persistent()]
with open(f"out/categories.json", "wb") as f:
f.write(orjson.dumps(value))
elif key == "images":
read_images(value.persistent())
if __name__ == "__main__":
main()