我正在使用一个包含评论数据的大型 gzip 压缩 JSON 文件,其格式为 JSON 对象列表。每个对象由换行符分隔。我的目标是使用
review_text
高效地从每个对象中提取 ijson
字段,而不将整个文件加载到内存中,因为该文件包含超过 1500 万条记录。
但是,当尝试使用
ijson
解析文件时,我遇到以下错误:
IncompleteJSONError: parse error: trailing garbage
votes": 16, "n_comments": 0} {"user_id": "8842281e1d1347389f
(right here) ------^
这是实际样品...
['{"user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "24375664", "review_id": "5cd416f3efc3f944fce4ce2db2290d5e", "rating": 5, "review_text": "Mind blowingly cool. Best science fiction I\'ve read in some time. I just loved all the descriptions of the society of the future - how they lived in trees, the notion of owning property or even getting married was gone. How every surface was a screen. \\n The undulations of how society responds to the Trisolaran threat seem surprising to me. Maybe its more the Chinese perspective, but I wouldn\'t have thought the ETO would exist in book 1, and I wouldn\'t have thought people would get so over-confident in our primitive fleet\'s chances given you have to think that with superior science they would have weapons - and defenses - that would just be as rifles to arrows once were. \\n But the moment when Luo Ji won as a wallfacer was just too cool. I may have actually done a fist pump. Though by the way, if the Dark Forest theory is right - and I see no reason why it wouldn\'t be - we as a society should probably stop broadcasting so much signal out into the universe.", "date_added": "Fri Aug 25 13:55:02 -0700 2017", "date_updated": "Mon Oct 09 08:55:59 -0700 2017", "read_at": "Sat Oct 07 00:00:00 -0700 2017", "started_at": "Sat Aug 26 00:00:00 -0700 2017", "n_votes": 16, "n_comments": 0}\n', '{"user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "18245960", "review_id": "dfdbb7b0eb5a7e4c26d59a937e2e5feb", "rating": 5, "review_text": "This is a special book. It started slow for about the first third, then in the middle third it started to get interesting, then the last third blew my mind. This is what I love about good science fiction - it pushes your thinking about where things can go. \\n It is a 2015 Hugo winner, and translated from its original Chinese, which made it interesting in just a different way from most things I\'ve read. For instance the intermixing of Chinese revolutionary history - how they kept accusing people of being \\"reactionaries\\", etc. \\n It is a book about science, and aliens. The science described in the book is impressive - its a book grounded in physics and pretty accurate as far as I could tell. Though when it got to folding protons into 8 dimensions I think he was just making stuff up - interesting to think about though. \\n But what would happen if our SETI stations received a message - if we found someone was out there - and the person monitoring and answering the signal on our side was disillusioned? That part of the book was a bit dark - I would like to think human reaction to discovering alien civilization that is hostile would be more like Enders Game where we would band together. \\n I did like how the book unveiled the Trisolaran culture through the game. It was a smart way to build empathy with them and also understand what they\'ve gone through across so many centuries. And who know a 3 body problem was an unsolvable math problem? But I still don\'t get who made the game - maybe that will come in the next book. \\n I loved this quote: \\n \\"In the long history of scientific progress, how many protons have been smashed apart in accelerators by physicists? How many neutrons and electrons? Probably no fewer than a hundred million. Every collision was probably the end of the civilizations and intelligences in a microcosmos. In fact, even in nature, the destruction of universes must be happening at every second--for example, through the decay of neutrons. Also, a high-energy cosmic ray entering the atmosphere may destroy thousands of such miniature universes....\\"", "date_added": "Sun Jul 30 07:44:10 -0700 2017", "date_updated": "Wed Aug 30 00:00:26 -0700 2017", "read_at": "Sat Aug 26 12:05:52 -0700 2017", "started_at": "Tue Aug 15 13:23:18 -0700 2017", "n_votes": 28, "n_comments": 1}\n']
这是书评下的原始数据集: https://mengtingwan.github.io/data/goodreads.html#datasets
ijson
直接解析文件,但会导致尾随垃圾错误。使用
ijson
时,我不断遇到以下错误:
IncompleteJSONError: parse error: trailing garbage
votes": 16, "n_comments": 0} {"user_id": "8842281e1d1347389f
(right here) ------^
如何使用
ijson
或任何其他方法有效地解析大型 gzip 压缩 JSON 文件,以避免将整个文件加载到内存中并且不会导致“尾随垃圾”错误?我可以进行哪些调整才能正确处理此文件格式?
这是当前尝试产生该错误的代码
import ijson
import pandas as pd
import gzip
review_texts = []
gzip_file_path = 'goodreads_dataset.json.gz'
with gzip.open(gzip_file_path, 'rt', encoding='utf-8') as f:
objects = ijson.items(f, 'item') # Use 'item' if it's a top-level array
for obj in objects:
if 'review_text' in obj:
review_texts.append(obj['review_text'])
df = pd.DataFrame(review_texts, columns=['review_text'])
df.to_pickle('reviews.pkl')
print(f"Saved {len(df)} review_text entries to 'reviews.pkl')
数据文件的每一行都包含一个 JSON,所以它的格式实际上是 JSON Lines 存档扩展名应为
.jsonl.gz
。
您可以简单地逐行读取文件,并使用常规的
json
模块来解析每一行的JSON:
import gzip
import json
review_texts = []
gzip_file_path = 'goodreads_dataset.json.gz'
with gzip.open(gzip_file_path, 'rt', encoding='utf-8') as f:
for line in f:
obj = json.loads(line)
review_texts.append(obj['review_text'])