我有一个包含三个值(字符串)和一个子字符串的列表。
列表中的每个字符串都需要在位置 20 到 50 之间搜索给定的子字符串,如果出现超过 5 次(该子字符串在每个字符串中),则打印出来。
如果字符串缺少子字符串,则应打印一条消息,指出缺少子字符串(在每个列表项中)。
输出应该是(考虑下面我的代码)
1 Enriched with SP1 binding sites
3 Contains no SP1 binding sites
seq_list = ["GGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGG", "GGGCGG", "BBBBBBB"]
binding_site = "GGGCGG"
for count, value in enumerate(seq_list, start=1):
if binding_site in value:
sumSP = int(sum(s.count('GGCGG')for s in seq_list))
if sumSP >20:
print(count, "enriched with SP1 binding sites")
else:
print(count,"No binding sites found.")
所以我有两个问题。首先,我在互联网上搜索了一个简单的解决方案来搜索 pos 20-50 之间的每个字符串,但只设法找到如何搜索整个列表位置(使用切片)。 第二个问题是我的代码
sumSP
不起作用,因为它为我的第二个字符串提供了 true,而第二个字符串应该为 false,因为我的列表中只有值 1 包含超过 5 个绑定站点。
下面的代码是我认为你想要的,但可以很容易地修改。它使用 REGEX 作为计算子字符串出现次数的简单方法。它展示了如何搜索字符串的一部分。
import re
seq_list = ["GGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGGGGGCGGAAAAGGGCGGAAAAGGGCGG", "GGGCGG", "BBBBBBB"]
binding_site = "GGGCGG"
search_for = 'GGCGG'
START = 20
FINISH = 50
for i, seq in enumerate(seq_list):
if not binding_site in seq:
print(f"seq {i} No binding sites found.")
elif len(seq) < FINISH:
print(f"seq {i} length {len(seq)} less than search size {FINISH}")
else:
num = len(re.findall(search_for, seq[START:FINISH]))
print(f"seq {i} has {num} found - enriched with SP1 binding sites")
给出:
seq 0 has 3 found - enriched with SP1 binding sites
seq 1 length 6 less than search size 50
seq 2, No binding sites found.