遗留数据到json的正则表达式优化

Question

我有一个类似于 Json 的旧数据格式。下面的代码块显示了一个示例：

1  {
2    base = {
3      attributeNumericName = V1D,
4      attributeName = hello,
5      attributeFloat = 1.0,
6      canContainLists = [
7        1.0,
8        2.0
9      ],
10     canContainStackedData = [
11       {
12         stackedAttribute = 1.0
13       }
14     ]
15   }
16 }

为了能够在 python 中有效地读取此内容，我使用正则表达式替换来转换文本。我正在使用以下三种替代品：

行动	来源	更换	影响线路
用数值替换所有属性	`([\w]{1,}) = ([\d\.]{1,},?)`	`"\1": \2`	5, 12
用字符串值替换所有属性	`([\w]{1,}) = ([\w]{1,})`	`"\1": "\2"`	3, 4
替换所有剩余属性	`([\w\d]{1,}) =`	`"\1":`	2、6、10

因为这些搜索要遍历内容三遍，所以速度相当慢。我总共需要调查几 TB 的文件。因此，我正在寻求优化这一点。我一直在尝试找出如何将这三种替代品结合在一起，但到目前为止还没有成功。任何帮助将非常感激。

Answer 1

原始格式使用等号进行赋值，并且没有正确引用键和字符串值。面临的挑战是在一次传递中有效地转换它，而不是使用多个正则表达式替换。我使用了一个正则表达式模式

([\w\d]+)\s*=\s*(?:(\d+\.?\d*)|(\[)|(\{)|(\w+))

来捕获键名称和不同的值类型（数字、列表、字典、单词）。然后，替换函数将每个匹配项转换为正确的 JSON 语法，将键括在引号中并根据其类型处理值。这比在同一文本上顺序运行多个正则表达式模式要高效得多。

import re
import json

def convert_to_json(input_text):
    pattern = r'([\w\d]+)\s*=\s*(?:(\d+\.?\d*)|(\[)|(\{)|(\w+))'
    
    def replacement(match):
        key = match.group(1)
        number = match.group(2)
        list_start = match.group(3)
        dict_start = match.group(4)
        word = match.group(5)
        
        if number:
            return f'"{key}": {number}'
        elif list_start:
            return f'"{key}": ['
        elif dict_start:
            return f'"{key}": {{'
        elif word:
            return f'"{key}": "{word}"'
        
        return match.group(0)
    
    json_like = re.sub(pattern, replacement, input_text)

我是这样测试的：

def test_conversion():
    test_input = """{
  base = {
    attributeNumericName = V1D,
    attributeName = hello,
    attributeFloat = 1.0,
    canContainLists = [
      1.0,
      2.0
    ],
    canContainStackedData = [
      {
        stackedAttribute = 1.0
      }
    ]
  }
}"""

    converted = convert_to_json(test_input)
    print("Converted format:")
    print(converted)
    
    # Verify it's valid JSON
    try:
        json.loads(converted)
        print("\nSuccessfully parsed as valid JSON!")
    except json.JSONDecodeError as e:
        print(f"\nError: Not valid JSON: {e}")

这导致了这样的回应：

Converted format:
{
  "base": {
    "attributeNumericName": "V1D",
    "attributeName": "hello",
    "attributeFloat": 1.0,
    "canContainLists": [
      1.0,
      2.0
    ],
    "canContainStackedData": [
      {
        "stackedAttribute": 1.0
      }
    ]
  }
}

Successfully parsed as valid JSON!

遗留数据到json的正则表达式优化

问题描述投票：0回答：1

1个回答

最新问题

遗留数据到json的正则表达式优化

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1