如何从数据集中删除无用的元素

问题描述 投票:1回答:3

我有一个数据集,它看起来像如下:

 {0: {"address": 0,
         "ctag": "TOP",
         "deps": defaultdict(<class "list">, {"ROOT": [6, 51]}),
         "feats": "",
         "head": "",
         "lemma": "",
         "rel": "",
         "tag": "TOP",
         "word": ""},
     1: {"address": 1,
         "ctag": "Ne",
         "deps": defaultdict(<class "list">, {"NPOSTMOD": [2]}),
         "feats": "_",
         "head": 6,
         "lemma": "اشرف",
         "rel": "SBJ",
         "tag": "Ne",
         "word": "اشرف"},

我想从此数据集中删除"deps":...?。我尝试了这段代码但是没有用,因为"depts":的值在dict的每个元素中都有所不同。

import re
import simplejson as simplejson

with open("../data/cleaned.txt", 'r') as fp:
    lines = fp.readlines()
    k = str(lines)
    a = re.sub(r'\d:', '', k) # this is for removing numbers like `1:{..`
    json_data = simplejson.dumps(a)
    #print(json_data)
    n = eval(k.replace('defaultdict(<class "list">', 'list'))
    print(n)
python json preprocessor
3个回答
1
投票

正确的方法是修复生成文本文件的代码。这个defaultdict(<class "list">, {"ROOT": [6, 51]})暗示当需要更智能的格式时它使用简单的repr

如果无法实现真正​​的修复,以下只是一个穷人的解决方法。

摆脱"deps": ...很容易:它足以一次读取一行文件并丢弃任何以""deps"开头的文件(忽略初始空格)。但这还不够,因为当json坚持键只是文本时,文件包含数字键。因此必须识别和引用数字键。

这可以允许加载文件:

import re import simplejson as simplejson

with open("../data/cleaned.txt", 'r') as fp:
    k = ''.join(re.sub(r'(?<!\w)(\d+)', r'"\1"',line)
        for line in fp if not line.strip().startswith('"deps"'))

# remove an eventual last comma
k = re.sub(r',[\s\n]*$', '', k, re.DOTALL)

# uncomment if the file does not contain the last }
# k += '}'

js = json.loads(k)

0
投票

尝试

import json
with open("../data/cleaned.txt", 'r') as fp:
    data = json.load(fp)
    for key, value in data.items():
        value.pop("deps", None)

现在你将拥有没有deps的数据。如果要将记录转储到新文件

json.dump(data, "output.json")

0
投票

怎么样

#!/usr/bin/env python
# -*- coding: utf-8 -*-

data = {0: {"address": 0,
            "ctag": "TOP",
            "deps": 'something',
            "feats": "",
            "head": "",
            "lemma": "",
            "rel": "",
            "tag": "TOP",
            "word": ""},
        1: {"address": 1,
            "ctag": "Ne",
            "deps": 'something',
            "feats": "_",
            "head": 6,
            "lemma": "اشرف",
            "rel": "SBJ",
            "tag": "Ne",
            "word": "اشرف"}}

for value in data.values():
    if 'deps' in value:
        del value['deps']
© www.soinside.com 2019 - 2024. All rights reserved.