根据2(并且理想地推广到n)任意分组规则对字母串列表进行聚类?

问题描述 投票:1回答:1

我想根据包含n个给定集合的任何/所有/无字母,对n组中可变长度的一组字符串(字母)进行排序。

例如,在这里,我试图按照以下规则对两组中字母'A,B,P,Q,X'的所有组合进行排序:group1必须包括'A,P'中的全部/任何一个(但不是' B,Q'),group2必须包括'B,Q'中的全部/任何一个(但不包括'A,P')。我的最终目标是建立一个列表,其中尽可能隔离组(例如开始和结束),中间没有任何组成员的字符串,后面是中间和极端之间的组和成员的成员。理想情况下,顺序为:all-1 / none-2,some-1 / none-2,all-1 / some-2,none-1-2 / some-1-2,all-2 / some-1,一些-2 /无-1,全2型/无1。

labels_powerset = ['A','B','P','Q','X',
    'AB','AP','AQ','AX','BP','BQ','BX','PQ','PX','QX',
    'ABP','ABQ','ABX','APQ','APX','AQX','BPQ','BPX','BQX','PQX',
    'ABPQ','ABPX','ABQX','APQX','BPQX','ABPQX']

labels_for_order = []

for length in range(1,len(max(labels_powerset,key=len))+1):
    order = [label for label in labels_powerset if len(label)==length]
    labels_for_order.append(order)

group1 = ['A','P']
group2 = ['B','Q']

all1 = [y for y in[[label for label in order if all(x in label for x in group1) and not any(y in label for y in group2)]
        for order in labels_for_order]if y]

any1 = [y for y in[[label for label in order if any(x in label for x in group1) and not all(x in label for x in group1) and not any(y in label for y in group2)]
        for order in labels_for_order]if y]

all2 = [y for y in[[label for label in order if all(x in label for x in group2) and not any(y in label for y in group1)]
        for order in labels_for_order]if y]

any2 = [y for y in[[label for label in order if any(x in label for x in group2) and not all(x in label for x in group2) and not any(y in label for y in group1)]
        for order in labels_for_order]if y]

none = [y for y in[[label for label in order if not any(x in label for x in group1) and not any(y in label for y in group2)]
        for order in labels_for_order]if y]

both = [y for y in[[label for label in order if any(x in label for x in group1) and any(y in label for y in group2)]
        for order in labels_for_order]if y]

both1 = [both[x] for x in range(0,int(len(both)/2))]

both2 = [both[x] for x in range(int(len(both)/2),len(both))]

sorted_labels = flatten(any1+all1+both1+none+both2+all2+any2)

目标是在字符串的成员资格和长度方面使列表尽可能对称。

我在编码方面很陌生并且已经阅读了k-means上的内容,但无法弄清楚如何将其应用于字母串。

我如何更有效地做到这一点,并且可以推广到n组/规则?

python-3.x cluster-analysis pattern-recognition
1个回答
1
投票

K-means用于多变量连续数据,并且聚类不会尝试制作平衡组。

您应该考虑使用排序。

定义分数功能。例如,为每个“好”字母给出+1,为每个“坏”字母给-1,如果是纯的则给予+ -100的奖励。

然后根据此分数对单词进行排序。

© www.soinside.com 2019 - 2024. All rights reserved.