我有一个包含多个分类列的 csv 文件,但由于输入错误,这些列中的大多数都包含混乱的数据(例如,“边距”列的“spiculated”类别的“spciulated”、“SPICULATED”等)。有没有标准的方法来处理这种情况?
准确地说,我想将 csv 文件直接读入一个干净的 DataFrame,其中分类列具有 dtype 类别,但所有变体都折叠到一个类别中(例如,“spiculated”的每个变体将被读取为“spiculated”) )。例如,拼写变体可以由字典给出。
预期解决方案:
import pandas as pd
FEAT_VALS = {
"margins": {
"spiculated": ["spiculated", "spiiculated", "SPICULATED"],
"circumscribed": ["circumscribed", "cicumscribed"],
}
}
# somehow give FEAT_VALS to read_csv
df = pd.read_csv('test.csv', dtype='category')
df.margins
其中 test.csv 是:
margins
spiculated
spiiculated
SPICULATED
circumscribed
cicumscribed
获得:
0 spiculated
1 spiculated
2 spiculated
3 circumscribed
4 circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']
但是,如果没有拼写变体信息,我得到:
0 spiculated
1 spiiculated
2 SPICULATED
3 circumscribed
4 cicumscribed
Name: margins, dtype: category
Categories (5, object): ['SPICULATED', 'cicumscribed', 'circumscribed', 'spiculated', 'spiiculated']
我当前的解决方案:看起来像这样
df2 = pd.read_csv('test.csv')
for feat, feat_vals in FEAT_VALS.items():
for enc_val, str_vals in feat_vals.items():
df2.loc[df2[feat].isin(str_vals), feat] = enc_val
df2.margins = df2.margins.astype('category')
您的问题由三个较小的部分组成:
使字符串具有可比性(例如,没有重音符号并且大小写相同)。
参考号在 Python unicode 字符串中删除重音符号(规范化)的最佳方法是什么?
选择距离算法。
参考号寻找可以在字级执行levenshtein/其他编辑距离的python库
从“修改后的”输入创建一个 Pandas DataFrame
参考号从生成器创建 pandas DataFrame?
这是一个使用编辑距离进行比较的示例实现:
import csv
import unicodedata
import Levenshtein
import pandas as pd
def normalize_string(s):
return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn').lower()
def csv_parse(filename, categories = ["spiculated", "circumscribed"]):
normalized_categories = [normalize_string(s) for s in categories]
with open(filename) as f:
reader = csv.DictReader(f)
for row in reader:
normalized_margins = normalize_string(row["margins"])
min_distance = float('inf')
for i,normalized_category in enumerate(normalized_categories):
distance = Levenshtein.distance(normalized_margins, normalized_category)
if distance < min_distance:
min_distance = distance
row["margins"] = categories[i]
yield row
df = pd.DataFrame(data=csv_parse("test.csv"), dtype='category')
print(df.margins)
输出:
0 spiculated
1 spiculated
2 spiculated
3 circumscribed
4 circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']