我想在一组字符串中找到最长的常见后缀,以便在我的自然语言处理项目中检测一些潜在的重要语素。
给定频率K>=2
,找到字符串列表中最常见的K最长后缀S1,S2,S3...SN
为了简化问题,这里有一些例子:
输入1:
K=2
S=["fireman","woman","businessman","policeman","businesswoman"]
输出1:
["man","eman","woman"]
Explanation1:
“男人”发生4次,“eman”发生2次,“女人”发生2次
可以理解的是,输出还跟踪每个公共后缀的频率
{"man":4,"eman":2,"woman":2}
不能保证每个单词至少有一个长度的公共后缀,请参见下面的示例。
输入2:
K=2
S=["fireman","woman","businessman","policeman","businesswoman","apple","pineapple","people"]
输出2:
["man","eman","woman","ple","apple"]
说明2:
“男人”发生4次,“eman”发生2次,“女人”发生2次
“ple”发生3次,“apple”发生2次
任何有效的算法来解决这个问题?
我已经提到了后缀树和广义后缀树算法,但它似乎不适合这个问题。
(顺便说一下,我正在研究一个中国的NLP项目,这意味着有比英语更多的人物)
如果你想要真正有效this paper可以帮助。该算法为您提供O(n log n)中一组字符串中最频繁(最长)的子串。基本思路如下:
来自the paper for 3字符串的示例
s1 = bbabab
s2 = abacac
sz = bbaaa