文本中存在一些关键字及其出现的开始/结束位置。关键字可能部分重叠,例如“某事”->“某事”/“某事”/“某事”:
keywords_occurences = {
"key_1": [(11, 59)],
"key_2": [(24, 46), (301, 323), (1208, 1230), (1673, 1695)],
"key_3": [(24, 56), (1208, 1240)],
...
}
我需要为每个关键字选择一个位置,这样它们就不会重叠,对于这种情况,解决方案是:
key_1: 11-59
key_2: 301-323 (or 1673-1695, it does not matter)
key_3: 1208-1240
如果无法做到这一点,请选择最大数量的唯一非重叠键。
看起来像是一种“精确命中集”问题,但我找不到算法描述。
我认为下面的代码可以满足您的需求。
#!/usr/bin/env python
# keyword occurrences -> [('key_1', (11, 59)), ('key_2', (301, 333)), ('key_3', ())]
kw_all_occ = {"key_1" : [(11, 59)],
"key_2" : [(24, 56), (301, 333), (1208, 1240), (1673, 1705)],
"key_3" : [(24, 46), (1208, 1230)]}
def non_overlapping_occ(occ):
# dictionary with all keyword occurrences
all_occ = dict({})
all_occ.update(occ)
# list with the first non overlapping occurrences of every keyword ->
# [('key_1', (start_1, end_1)),('key_2', (start_2, end_2)),...]
result = []
# Sort keywords by length -> [(22, 'key_3'), (32, 'key_2'), (48, 'key_1')]
kw_lengths = []
for k, v in all_occ.iteritems():
kw_lengths.append((v[0][1] - v[0][0], k))
kw_lengths.sort()
while len(kw_lengths):
# Current longest keyword
longest_keyword = kw_lengths.pop(-1)[1]
try:
result.append((longest_keyword, all_occ[longest_keyword][0]))
# Remove occurrences overlapping any occurrence of the current
# longest_keyword value
for item in all_occ[longest_keyword]:
start = item[0]
end = item[1]
for l, k in kw_lengths:
v = all_occ[k]
all_occ[k] = filter(lambda x: (x[0] > end) | (x[1] < start), v)
except IndexError:
result.append((longest_keyword, ()))
return result
print non_overlapping_occ(kw_all_occ)
它产生以下输出:
vicent@deckard:~$ python prova.py
[('key_1', (11, 59)), ('key_2', (301, 333)), ('key_3', ())]
请注意,我没有在代码中使用集合。您的问题只是表明集合可以帮助解决问题,所以我理解使用集合对您来说不是强制性的。
另请注意,该代码尚未经过深入测试,但似乎工作得很好(它还正确处理了问题中提供的关键字出现情况。事实上,这些出现情况可以用一些更简单但不太通用的代码来解决)。
如果您定义了重叠的含义,则可以使用该函数来过滤掉您不想看到的笛卡尔积。
>>> keywords_occurences = {
... "key_1": [(11, 59)],
... "key_2": [(24, 46), (301, 323), (1208, 1230), (1673, 1695)],
... "key_3": [(24, 56), (1208, 1240)],}
...
>>> def overlapping(x):
... for i in range(len(x)):
... for j in range(i+1,len(x)):
... a,b=sorted((x[i],x[j]))
... if b[0]<=a[1]:
... return True
...
>>> from itertools import product, filterfalse
>>> for p in filterfalse(overlapping,product(*keywords_occurences.values())):
... print(dict(zip(keywords_occurences, p)))
...
{'key_1': (11, 59), 'key_2': (301, 323), 'key_3': (1208, 1240)}
{'key_1': (11, 59), 'key_2': (1673, 1695), 'key_3': (1208, 1240)}