如何在Python中删除重复项并统一列表中值彼此非常接近的值?

问题描述 投票:0回答:1

我的 Python 列表如下:

x1 = ['lock-service',
 'jenkins-service',
 'xyz-reporting-service',
 'ansible-service',
 'harbor-service',
 'version-service',
 'jira-service',
 'kubernetes-service',
 'capo-service',
 'permission-service',
 'artifactory-service',
 'vault-service',
 'harbor-service-prod',
 'rundeck-service',
 'cruise-control-service',
 'artifactory-service.xyz.abc.cloud',
 'helm-service',
 'Capo Service',
 'rocket-chat-service',
 'reporting-service',
 'bitbucket-service',
 'rocketchat-service']

x2 = ['journal-service',
 'lock-service',
 'jenkins-service',
 'xyz-reporting-service',
 'ansible-service',
 'harbor-service',
 'version-service',
 'jira-service',
 'kubernetes-service',
 'capo-service',
 'permission-service',
 'artifactory-service',
 'vault-service',
 'rundeck-service',
 'cruise-control-service',
 'helm-service',
 'database-ticket-service',
 'rocket-chat-service',
 'ansible-dpservice',
 'reporting-service',
 'bitbucket-service',
 'rocketchat-service']

正如您在两个列表中看到的,重复值以不同的形式出现,例如:

列表1中:

  • “xyz-报告服务”和“报告服务”
  • “港口服务”和“港口服务产品”
  • “变调夹服务”和“变调夹服务”
  • “artifactory-service”和“artifactory-service.xyz.abc.cloud”
  • 'rocket-chat-service' 和 'rocketchat-service'

列表2中:

  • “xyz-报告服务”和“报告服务”
  • 'rocket-chat-service' 和 'rocketchat-service'
  • “ansible-service”和“ansible-dpservice”

我需要一个通用的解决方案,不仅适用于这些示例列表:

  • 将删除上面显示的重复样本值
  • 将列表中的值统一为名称服务形式

如何在 Python 3.11 中做到这一点?

python pandas regex list duplicates
1个回答
0
投票

来自这个帖子

!pip install thefuzz

x1 = ['lock-service',
 'jenkins-service',
 'xyz-reporting-service',
 'ansible-service',
 'harbor-service',
 'version-service',
 'jira-service',
 'kubernetes-service',
 'capo-service',
 'permission-service',
 'artifactory-service',
 'vault-service',
 'harbor-service-prod',
 'rundeck-service',
 'cruise-control-service',
 'artifactory-service.xyz.abc.cloud',
 'helm-service',
 'Capo Service',
 'rocket-chat-service',
 'reporting-service',
 'bitbucket-service',
 'rocketchat-service']

from itertools import combinations
from thefuzz import fuzz

[(ratio, a, b) for a, b in combinations(x1, 2) if (ratio:=fuzz.partial_ratio(a, b)) > 90 ]

输出:

[(91, 'lock-service', 'rundeck-service'),
 (100, 'xyz-reporting-service', 'reporting-service'),
 (100, 'harbor-service', 'harbor-service-prod'),
 (100, 'artifactory-service', 'artifactory-service.xyz.abc.cloud'),
 (94, 'rocket-chat-service', 'rocketchat-service')]
© www.soinside.com 2019 - 2024. All rights reserved.