考虑一个包含以下数据的文档
btc_transactions.json
:
{"txid":"00a5b60bf38d0605a7ed65c557722e42c1637e1dade80e37b7fc73cea3b67d9b","consensus_time":"2013-07-14T07:07:24.000000000Z","tx_position":"1058687963627546","n_balance_updates":"4","amount":"0.7","block_hash":"000000000000004b880aa75d36e2586b8570b8e15dc192f33e5e834bf51fd06a","height":"246495","miner_time":"2013-07-14T07:43:12.000000000Z","min_chain_sequence_number":"1058687963627730","max_chain_sequence_number":"1058687963627733","version":"1","physical_size":"227","consensus_size":"908","fee":"0.0001"}
{"txid":"991cbe2565efb9712d55db1683548625403b325ea9a9dfed906fce1fb9bd1941","consensus_time":"2013-07-14T07:07:24.000000000Z","tx_position":"1058687963627547","n_balance_updates":"4","amount":"0.66434783","block_hash":"000000000000004b880aa75d36e2586b8570b8e15dc192f33e5e834bf51fd06a","height":"246495","miner_time":"2013-07-14T07:43:12.000000000Z","min_chain_sequence_number":"1058687963627734","max_chain_sequence_number":"1058687963627737","version":"1","physical_size":"258","consensus_size":"1032","fee":"0.0001"}
{"txid":"49044a76c558afad2b8f3bb895c958922e08bdf47931a60110ff9{"txid":"5038df9cc96824f7b990f4cc9405a5babe4a742e79b0c325e0487033b78d1e85","consensus_time":"2013-07-16T06:43:24.000000000Z","tx_position":"1060118187737455","n_balance_updates":"4","amount":"0.64502946","block_hash":"000000000000004e79a680599d4e3a2f3fc244d7ef7238666fa52418008d53a0","height":"246828","miner_time":"2013-07-16T07:33:26.000000000Z","min_chain_sequence_number":"1060118187739070","max_chain_sequence_number":"1060118187739073","version":"1","physical_size":"227","consensus_size":"908","fee":"0.0001"}
{"txid":"c785ea0649bf6df832afa9e58b5719967b8d30eef4878fc0bb6302e8b8fc0e31","consensus_time":"2013-07-14T07:12:41.000000000Z","tx_position":"1058692258594819","n_balance_updates":"4","amount":"400","block_hash":"0000000000000022847171b0b5a2e0c0ddf2444ef2dd6925c8a85891f1199e08","height":"246496","miner_time":"2013-07-14T08:09:32.000000000Z","min_chain_sequence_number":"1058692258594827","max_chain_sequence_number":"1058692258594830","version":"1","physical_size":"225","consensus_size":"900","fee":"0"}
倒数第二行是无效的 JSON 文档,而其他 3 行是有效的(我想使用
dask
处理更多行和文件)。
使用
pd.read_json(..., lines=True)
加载此数据失败,并显示 ValueError: Unexpected character found when decoding object value
。
是否有机会跳过这些行,类似于 pandas.read_csv 的 on_bad_lines
参数的行为?
我相信这个增强请求的目的是类似的,但看起来它并没有实现。 我知道这个问题具有相同的出发点,也可以从潜在的解决方案中受益。
重要的是,故障线路不会导致编码相关错误。 例如。选项
encoding_errors='ignore'
并不能解决问题,切换到 pyarrow
引擎也不能解决问题。然而,后者会产生更精确的错误消息
pyarrow.lib.ArrowInvalid:JSON 解析错误:对象成员后面缺少逗号或“}”。在第 2 行
一种可能的解决方案是使用
try...except
预处理 json 文件以删除无效行,然后将有效行加载到 pandas
:
def get_valid_json_lines(file_path):
with open(file_path, "r") as f:
for line in f:
try:
json.loads(line)
yield line
except json.JSONDecodeError:
continue
file_path = "btc_transactions.json"
valid_lines = "".join(get_valid_json_lines(file_path))
df = pd.read_json(valid_lines, lines=True)
txid consensus_time tx_position ... physical_size consensus_size fee
0 00a5b60bf38d0605a7ed65c557722e42c1637e1dade80e... 2013-07-14 07:07:24+00:00 1058687963627546 ... 227 908 0.0001
1 991cbe2565efb9712d55db1683548625403b325ea9a9df... 2013-07-14 07:07:24+00:00 1058687963627547 ... 258 1032 0.0001
2 c785ea0649bf6df832afa9e58b5719967b8d30eef4878f... 2013-07-14 07:12:41+00:00 1058692258594819 ... 225 900 0.0000