将句子转换为单词列表,然后找到根字符串的索引应该做的事情:
sentence = "lack of association between the promoter polymorphism of the mtnr1a gene and adolescent idiopathic scoliosis"
root = "mtnr1a"
try:
words = sentence.split()
n = words.index(root)
cutoff = ' '.join(words[n-4:n+5])
except ValueError:
cutoff = None
print(cutoff)
结果:
promoter polymorphism of the mtnr1a gene and adolescent idiopathic
如何在pandas数据帧中使用它?
我尝试:
sentence = data['sentence']
root = data['rootword']
def cutOff(sentence,root):
try:
words = sentence.str.split()
n = words.index(root)
cutoff = ' '.join(words[n-4:n+5])
except ValueError:
cutoff = None
return cutoff
data.apply(cutOff(sentence,root),axis=1)
但它不起作用......
编辑:
如何在根词后4个字符串后切句,当根词在句子中的第一个位置时,以及根词在句子中的最后位置时?例如:
sentence = "mtnr1a lack of association between the promoter polymorphism of the gene and adolescent idiopathic scoliosis"
out if root in first position:
"mtnr1a lack of association between"
out if root in last position:
"lack of association between the promoter polymorphism of the gene and adolescent idiopathic scoliosis"
"adolescent idiopathic scoliosis mtnr1a"
代码中的两个小调整应该可以解决您的问题:
首先,在数据帧上调用apply()
会将函数应用于调用它的DataFrame的每一行中的值。
您不必将列作为函数的输入传入,并且调用sentence.str.split()
没有意义。在cutOff()
函数内部,sentence
只是一个常规字符串(不是列)。
将您的功能更改为:
def cutOff(sentence,root):
try:
words = sentence.split() # this is the line that was changed
n = words.index(root)
cutoff = ' '.join(words[n-4:n+5])
except ValueError:
cutoff = None
return cutoff
接下来,您只需指定将作为函数输入的列 - 您可以使用lambda
执行此操作:
df.apply(lambda x: cutOff(x["sentence"], x["rootword"]), axis=1)
#0 promoter polymorphism of the mtnr1a gene and a...
#dtype: object