我有一个基因列表,我需要确定列表中的基因是否存在于“文章标题”中,如果存在,则在句子中找到基因的开始和结束位置。
开发的代码确实识别基因并检测基因在句子中的位置。但是,我需要帮助找到基因的起始位置和终点位置
doc = tree.getroot()
for ArticleTitle in doc.iter('ArticleTitle'):
file1 = (ET.tostring(ArticleTitle, encoding='utf8').decode('utf8'))
filename = file1[52:(len(file1))]
Article= filename.split("<")[0]
# print(Article)
# print(type(Article))
title= Article.split()
gene_list = ["ABCD1","ADA","ALDOB","APC","ARSB","ATAD3B","AXIN2","BLM","BMPR1A","BRAF","BRCA1"]
for item in title:
for item1 in gene_list:
if item == item1:
str_title= ' '.join(title)
print(str_title)
print("Gene Found: " + item)
index= title.index(item)
print("Index of the Gene :" +str(index))
result = 0
for char in str_title:
result +=1
print(result)
目前的输出是:
Healthy people 2000: a call to action for ADA members.
Gene Found: ADA
Index of the Gene :8
54
预期产出是:
Healthy people 2000: a call to action for ADA members.
Gene Found: ADA
Index of the Gene :8
Gene start position: 42
Gene End postion: 45
开始和结束位置也应该计算单词之间的空格。
可以使用正则表达式
l=["ABCD1","ADA","ALDOB","APC","ARSB"]
l='|'.join(l)
test_string='Healthy people 2000: a call to action for ADA members.'
pos=0
for i in test_string.split():
m=re.search(l,i)
if m:
gene=m.group(0)
start=test_string.find(gene)
end=start+len(gene)
print(start,end,gene,pos)
pos+=1
产量
(42, 45, 'ADA', 8)
没有字符串中实际位置的较短解决方案可能是
l=["ABCD1","ADA","ALDOB","APC","ARSB"]
l='|'.join(l)
test_string='Healthy people 2000: a call to action for ADA members.'
[(m.start(),m.group(0),m.end()) for m in re.finditer(l,test_string)]
我们也可以使用Flashtext
from flashtext import KeywordProcessor
kpo = KeywordProcessor(case_sensitive=True)
gene_list = ["ABCD1","ADA","ALDOB","APC","ARSB","ATAD3B","AXIN2","BLM","BMPR1A","BRAF","BRCA1"]
for word in gene_list:
kpo.add_keyword(word)
kpo.extract_keywords("Healthy people 2000: a call to action for ADA members.",span_info=True)
#o/p --> [('ADA', 42, 45)]