跳过 JSON 文档中导致 pd.read_json(...,lines=True) 中出现“ValueError:解码对象值时发现意外字符”的行

问题描述 投票:0回答:1

考虑一个包含以下数据的文档

btc_transactions.json


{"txid":"00a5b60bf38d0605a7ed65c557722e42c1637e1dade80e37b7fc73cea3b67d9b","consensus_time":"2013-07-14T07:07:24.000000000Z","tx_position":"1058687963627546","n_balance_updates":"4","amount":"0.7","block_hash":"000000000000004b880aa75d36e2586b8570b8e15dc192f33e5e834bf51fd06a","height":"246495","miner_time":"2013-07-14T07:43:12.000000000Z","min_chain_sequence_number":"1058687963627730","max_chain_sequence_number":"1058687963627733","version":"1","physical_size":"227","consensus_size":"908","fee":"0.0001"}
{"txid":"991cbe2565efb9712d55db1683548625403b325ea9a9dfed906fce1fb9bd1941","consensus_time":"2013-07-14T07:07:24.000000000Z","tx_position":"1058687963627547","n_balance_updates":"4","amount":"0.66434783","block_hash":"000000000000004b880aa75d36e2586b8570b8e15dc192f33e5e834bf51fd06a","height":"246495","miner_time":"2013-07-14T07:43:12.000000000Z","min_chain_sequence_number":"1058687963627734","max_chain_sequence_number":"1058687963627737","version":"1","physical_size":"258","consensus_size":"1032","fee":"0.0001"}
{"txid":"49044a76c558afad2b8f3bb895c958922e08bdf47931a60110ff9{"txid":"5038df9cc96824f7b990f4cc9405a5babe4a742e79b0c325e0487033b78d1e85","consensus_time":"2013-07-16T06:43:24.000000000Z","tx_position":"1060118187737455","n_balance_updates":"4","amount":"0.64502946","block_hash":"000000000000004e79a680599d4e3a2f3fc244d7ef7238666fa52418008d53a0","height":"246828","miner_time":"2013-07-16T07:33:26.000000000Z","min_chain_sequence_number":"1060118187739070","max_chain_sequence_number":"1060118187739073","version":"1","physical_size":"227","consensus_size":"908","fee":"0.0001"}
{"txid":"c785ea0649bf6df832afa9e58b5719967b8d30eef4878fc0bb6302e8b8fc0e31","consensus_time":"2013-07-14T07:12:41.000000000Z","tx_position":"1058692258594819","n_balance_updates":"4","amount":"400","block_hash":"0000000000000022847171b0b5a2e0c0ddf2444ef2dd6925c8a85891f1199e08","height":"246496","miner_time":"2013-07-14T08:09:32.000000000Z","min_chain_sequence_number":"1058692258594827","max_chain_sequence_number":"1058692258594830","version":"1","physical_size":"225","consensus_size":"900","fee":"0"}

倒数第二行是无效的 JSON 文档,而其他 3 行是有效的(我想使用

dask
处理更多行和文件)。

使用

pd.read_json(..., lines=True)
加载此数据失败,并显示
ValueError: Unexpected character found when decoding object value
。 是否有机会跳过这些行,类似于 pandas.read_csv 的
on_bad_lines
参数的行为?

我相信这个增强请求的目的是类似的,但看起来它并没有实现。 我知道这个问题具有相同的出发点,也可以从潜在的解决方案中受益。

重要的是,故障线路不会导致编码相关错误。 例如。选项

encoding_errors='ignore'
并不能解决问题,切换到
pyarrow
引擎也不能解决问题。然而,后者会产生更精确的错误消息

pyarrow.lib.ArrowInvalid:JSON 解析错误:对象成员后面缺少逗号或“}”。在第 2 行

json python-3.x pandas dataframe jsonparser
1个回答
0
投票

一种可能的解决方案是使用

try...except
预处理 json 文件以删除无效行,然后将有效行加载到
pandas
:

def get_valid_json_lines(file_path):
    with open(file_path, "r") as f:
        for line in f:
            try:
                json.loads(line)
                yield line
            except json.JSONDecodeError:
                continue


file_path = "btc_transactions.json"
valid_lines = "".join(get_valid_json_lines(file_path))
df = pd.read_json(valid_lines, lines=True)
                                                txid            consensus_time       tx_position  ...  physical_size  consensus_size     fee
0  00a5b60bf38d0605a7ed65c557722e42c1637e1dade80e... 2013-07-14 07:07:24+00:00  1058687963627546  ...            227             908  0.0001
1  991cbe2565efb9712d55db1683548625403b325ea9a9df... 2013-07-14 07:07:24+00:00  1058687963627547  ...            258            1032  0.0001
2  c785ea0649bf6df832afa9e58b5719967b8d30eef4878f... 2013-07-14 07:12:41+00:00  1058692258594819  ...            225             900  0.0000
© www.soinside.com 2019 - 2024. All rights reserved.