我想根据包含n个给定集合的任何/所有/无字母,对n组中可变长度的一组字符串(字母)进行排序。
例如,在这里,我试图按照以下规则对两组中字母'A,B,P,Q,X'的所有组合进行排序:group1必须包括'A,P'中的全部/任何一个(但不是' B,Q'),group2必须包括'B,Q'中的全部/任何一个(但不包括'A,P')。我的最终目标是建立一个列表,其中尽可能隔离组(例如开始和结束),中间没有任何组成员的字符串,后面是中间和极端之间的组和成员的成员。理想情况下,顺序为:all-1 / none-2,some-1 / none-2,all-1 / some-2,none-1-2 / some-1-2,all-2 / some-1,一些-2 /无-1,全2型/无1。
labels_powerset = ['A','B','P','Q','X',
'AB','AP','AQ','AX','BP','BQ','BX','PQ','PX','QX',
'ABP','ABQ','ABX','APQ','APX','AQX','BPQ','BPX','BQX','PQX',
'ABPQ','ABPX','ABQX','APQX','BPQX','ABPQX']
labels_for_order = []
for length in range(1,len(max(labels_powerset,key=len))+1):
order = [label for label in labels_powerset if len(label)==length]
labels_for_order.append(order)
group1 = ['A','P']
group2 = ['B','Q']
all1 = [y for y in[[label for label in order if all(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
any1 = [y for y in[[label for label in order if any(x in label for x in group1) and not all(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
all2 = [y for y in[[label for label in order if all(x in label for x in group2) and not any(y in label for y in group1)]
for order in labels_for_order]if y]
any2 = [y for y in[[label for label in order if any(x in label for x in group2) and not all(x in label for x in group2) and not any(y in label for y in group1)]
for order in labels_for_order]if y]
none = [y for y in[[label for label in order if not any(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
both = [y for y in[[label for label in order if any(x in label for x in group1) and any(y in label for y in group2)]
for order in labels_for_order]if y]
both1 = [both[x] for x in range(0,int(len(both)/2))]
both2 = [both[x] for x in range(int(len(both)/2),len(both))]
sorted_labels = flatten(any1+all1+both1+none+both2+all2+any2)
目标是在字符串的成员资格和长度方面使列表尽可能对称。
我在编码方面很陌生并且已经阅读了k-means上的内容,但无法弄清楚如何将其应用于字母串。
我如何更有效地做到这一点,并且可以推广到n组/规则?
K-means用于多变量连续数据,并且聚类不会尝试制作平衡组。
您应该考虑使用排序。
定义分数功能。例如,为每个“好”字母给出+1,为每个“坏”字母给-1,如果是纯的则给予+ -100的奖励。
然后根据此分数对单词进行排序。