我有一个以 100% 序列同一性进行序列聚类的聚类文件,每个包含序列聚类,用聚类编号后跟 ID 表示。以下是文件格式的示例:
>Cluster 107
0 410aa, >TRINITY_DN9528_c0_g1_i1_30... *
1 410aa, >TRINITY_DN9528_c0_g1_i2_36... at 100.00%
2 404aa, >crgi_XP_011414097.1... at 100.00%
>Cluster 108
0 410aa, >TRINITY_DN11082_c0_g1_i1_69... *
1 410aa, >TRINITY_DN11082_c0_g1_i2_69... at 100.00%
>Cluster 109
0 410aa, >crgi_XP_011450995.2... *
>Cluster 110
0 407aa, >TRINITY_DN4674_c0_g1_i3_24... *
我想编写一个bash脚本,它可以提取ID中包含特定字符串的簇,但前提是该簇除了具有指定字符串的序列之外还有其他序列。例如,如果我输入字符串“crgi”,脚本应该只获取 ID 中包含该字符串的集群,但如果它是集群中唯一的序列,则不会。
这是输入字符串“crgi”的预期输出示例:
Clusters in file1.ids containing 'crgi':
107
我尝试过使用 grep、awk 和 cut,但在有效提取所需的簇时遇到困难。
有人可以指导如何编写这样的 bash 脚本来有效地完成此任务吗?任何帮助将不胜感激!谢谢。
我尝试过以下脚本,但它不起作用:
#!/bin/bash
# Define the search string
search_string="crgi"
# Loop through each file
for file in *.ids; do
# Extract cluster numbers containing the search string
clusters=$(awk -v search="$search_string" '$0 ~ search {print $1}' "$file")
# Initialize a flag for presence of other sequences
other_sequences_found=false
# Check if clusters contain other sequences
while read -r cluster; do
# Check if the cluster contains other sequences besides the search string
if [ "$(grep -c ">$search_string" "$file")" -gt 1 ]; then
other_sequences_found=true
break
fi
done <<< "$clusters"
# If other sequences found, print the cluster numbers
if [ "$other_sequences_found" = true ]; then
echo "Clusters in $file containing '$search_string':"
echo "$clusters"
echo
fi
done
从示例文件中可以清楚地看出,您只需要检查第一列是否大于 0,如下所示:
$ awk '/crgi/{if ($1>1) {print $0}}' test.file
2 404aa, >crgi_XP_011414097.1... at 100.00%