有没有一个pandas习惯用法可以读取带有拼写变体的分类数据的csv文件?

问题描述 投票:0回答:1

我有一个包含多个分类列的 csv 文件,但由于输入错误,这些列中的大多数都包含混乱的数据(例如,“边距”列的“spiculated”类别的“spciulated”、“SPICULATED”等)。有没有标准的方法来处理这种情况?

准确地说,我想将 csv 文件直接读入一个干净的 DataFrame,其中分类列具有 dtype 类别,但所有变体都折叠到一个类别中(例如,“spiculated”的每个变体将被读取为“spiculated”) )。例如,拼写变体可以由字典给出。

预期解决方案:

import pandas as pd

FEAT_VALS = {
    "margins": {
        "spiculated": ["spiculated", "spiiculated", "SPICULATED"],
        "circumscribed": ["circumscribed", "cicumscribed"],
    }
}

# somehow give FEAT_VALS to read_csv
df = pd.read_csv('test.csv', dtype='category')
df.margins

其中 test.csv 是:

margins
spiculated
spiiculated
SPICULATED
circumscribed
cicumscribed

获得:

0       spiculated
1       spiculated
2       spiculated
3    circumscribed
4    circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']

但是,如果没有拼写变体信息,我得到:

0       spiculated
1      spiiculated
2       SPICULATED
3    circumscribed
4     cicumscribed
Name: margins, dtype: category
Categories (5, object): ['SPICULATED', 'cicumscribed', 'circumscribed', 'spiculated', 'spiiculated']

我当前的解决方案:看起来像这样

df2 = pd.read_csv('test.csv')

for feat, feat_vals in FEAT_VALS.items():
    for enc_val, str_vals in feat_vals.items():
        df2.loc[df2[feat].isin(str_vals), feat] = enc_val

df2.margins = df2.margins.astype('category')
pandas csv idioms
1个回答
0
投票

您的问题由三个较小的部分组成:

  1. 使字符串具有可比性(例如,没有重音符号并且大小写相同)。
    参考号在 Python unicode 字符串中删除重音符号(规范化)的最佳方法是什么?

  2. 选择距离算法。
    参考号寻找可以在字级执行levenshtein/其他编辑距离的python库

  3. 从“修改后的”输入创建一个 Pandas DataFrame
    参考号从生成器创建 pandas DataFrame?


这是一个使用编辑距离进行比较的示例实现:

import csv
import unicodedata
import Levenshtein
import pandas as pd

def normalize_string(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn').lower()

def csv_parse(filename, categories = ["spiculated", "circumscribed"]):

    normalized_categories = [normalize_string(s) for s in categories]

    with open(filename) as f:

        reader = csv.DictReader(f)

        for row in reader:

            normalized_margins = normalize_string(row["margins"])
            min_distance = float('inf')

            for i,normalized_category in enumerate(normalized_categories):

                distance = Levenshtein.distance(normalized_margins, normalized_category)
                if distance < min_distance:
                    min_distance = distance
                    row["margins"] = categories[i]

            yield row

df = pd.DataFrame(data=csv_parse("test.csv"), dtype='category')

print(df.margins)

输出:

0       spiculated
1       spiculated
2       spiculated
3    circumscribed
4    circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']
© www.soinside.com 2019 - 2024. All rights reserved.