背景:
我有大约 8 万个单词的列表,其中可能存在拼写错误
(e.g., "apple" vs "applee" vs " apple" vs " aplee ").
我计划通过一次选择两个单词来创建一个数据框网格,然后应用模糊评分函数来比较相似性。我还应用标准文本清理,例如修剪、删除特殊字符、双空格等,然后获取唯一列表来检查相似性
做法:
我正在使用
itertools.combinations
函数创建数据框网格
#Sample python code
#Step1:
my_unique_list = ['apple','applee','aplee']
data_grid = pd.DataFrame(itertools.combinations(my_unique_list,2),columns = ['name1','name2'])
print(data_grid)
name1 name2
0 apple applee
1 apple aplee
2 applee aplee
我定义了一个计算模糊分数的函数
def fuzzy_score_func(row):
fuzzywuzzy_partial_ratio = fuzz.partial_ratio(row['name1'],row['name2'])
thefuzz_ratio = fuzz.ratio(row['name1'],row['name2'])
return fuzzywuzzy_partial_ratio, thefuzz_ratio
并使用apply函数得到最终分数
#Step2:
data_grid[['partial_ratio','ratio']] = data_grid.apply(fuzzy_score_func,axis = 1, result_type='expand')
print(data_grid)
name1 name2 partial_ratio ratio
0 apple applee 100 91
1 apple aplee 80 80
2 applee aplee 80 91
当列表约为 8k 时,此方法效果很好,其中检查所有组合在数据框中具有约 25Mn 行。
但是当我尝试将列表扩展到 80k 时,当我尝试使用所有可能的组合初始化数据帧时,我在步骤 1 中遇到内存错误。鉴于数据帧的大小约为 64 亿行,这是有道理的
File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\core\frame.py:738, in DataFrame.__init__(self, data, index, columns, dtype, copy)
736 data = np.asarray(data)
737 else:
--> 738 data = list(data)
739 if len(data) > 0:
740 if is_dataclass(data[0]):
MemoryError:
有关如何解决此内存问题的任何建议,或者是否有更好的方法来实现我的问题陈述。我尝试探索多处理、嵌套循环等,但没有取得重大成功。
我使用的是 Intel Windows 笔记本电脑
Processor: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz 3.00 GHz
Installed RAM: 32.0 GB (31.7 GB usable)
System type: 64-bit operating system, x64-based processor
我可能会尝试在不使用 pandas 的情况下仅使用
itertools
来开始使用此代码。
import csv
import itertools
import fuzzywuzzy.fuzz
MIN_RATION = 90
## ----------------------
## the result of cleaning and filtering your input data...
## ----------------------
my_unique_list = ['apple','applee','aplee']
## ----------------------
## ----------------------
## Create a result file of acceptably close matches
## ----------------------
with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
writer = csv.writer(file_out)
writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
for index, (word1, word2) in enumerate(itertools.combinations(my_unique_list, 2)):
if index % 1000 == 0:
print(f"combinations processed: {index}", end="\r", flush=True)
partial_ratio = fuzzywuzzy.fuzz.partial_ratio(word1, word2)
ratio = fuzzywuzzy.fuzz.ratio(word1, word2)
if max(partial_ratio, ratio) >= MIN_RATION:
writer.writerow([word1, word2, partial_ratio, ratio])
print()
print(f"Total combinations processed: {index+1}")
## ----------------------
虽然我不是多处理专家,但这可能有用。您可能想在较小的子集上测试一下:
import csv
import functools
import itertools
import multiprocessing
import fuzzywuzzy.fuzz
MIN_RATION = 90
def get_ratios(pair, queue):
partial_ratio = fuzzywuzzy.fuzz.partial_ratio(*pair)
ratio = fuzzywuzzy.fuzz.ratio(*pair)
if max(partial_ratio, ratio) >= MIN_RATION:
queue.put(list(pair) + [partial_ratio, ratio])
def main(my_unique_list):
with multiprocessing.Manager() as manager:
queue = manager.Queue()
with multiprocessing.Pool(processes=8) as pool:
_ = pool.map(functools.partial(get_ratios, queue=queue), itertools.combinations(my_unique_list, 2), chunksize=1000)
pool.close()
pool.join()
with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
writer = csv.writer(file_out)
writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
while not queue.empty():
item = queue.get()
writer.writerow(item)
print(item)
if __name__ == "__main__":
## ----------------------
## the result of cleaning and filtering your input data...
## ----------------------
my_unique_list = ['apple','applee','aplee']
## ----------------------
main(my_unique_list)