我想使用 Bash 脚本从 FASTA format 格式文件中查找 GC-content。 GC 含量基本上为 ( (G + C) 的数量) / ( (A + T + G + C) 的数量)。
我正在尝试使用 wc 命令。但我没能得到答案。
在查看文档和视频后,我想出了一个解决方案。
filename=$@ # Collecting all the filenames as parameters
for f in $filename # Looping over files
do
echo " $f is being processed..."
gc=( $( grep -v ">" < "$f" | grep -io 'g\|c'< "$f" | wc -l)) # Reading lines that don’t start with < using -v. grep -io matches to either g or c and outputs each match on single line. wc -l counts the number of lines or indirectly the number of g and c. This is stored in a variable.
total=( $( grep -v ">" < "$f" | tr -d '\s\r' | wc -c)) # Spaces, tabs, new line are removed from the file using tr. Then the number of characters are counted by wc -c
percent=( $( echo "scale=2;100*$gc/$total" |bc -l)) # bc -l is used to get the answer in float format. scale=2 mentions the number of decimal points.
echo " The GC content of $f is: "$percent"%"
echo
done
我正在学习生物信息学。
不要重新发明轮子。对于常见的生物信息学任务,请使用专为这些任务设计、经过充分测试、广泛使用并处理边缘情况的开源工具。例如,使用
EMBOSS
infoseq
实用程序。 EMBOSS
可以轻松安装,例如使用 conda
。
示例:
安装
EMBOSS
软件包(执行一次):
conda create --name emboss emboss --channel iuc
激活
conda
环境并使用EMBOSS
infoseq
,此处打印序列名称、长度和GC百分比:
source activate emboss
cat your_sequence_file_name.fasta | infoseq -auto -only -name -length -pgc stdin
source deactivate
这会打印到标准输出,如下所示:
Name Length %GC
seq_foo 119 60.50
seq_bar 104 39.42
seq_baz 191 46.60
...
无需脚本 只是 seqkit
> seqkit stat -a input.fa
output:
format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 N50_num Q20(%) Q30(%) AvgQual GC(%)
这应该有效:
#!/usr/bin/env sh
# Adapted from https://www.biostars.org/p/17680
# Fail on error
set -o errexit
# Disable undefined variable reference
set -o nounset
# ================
# CONFIGURATION
# ================
# Fasta file path
FASTA_FILE="file.fasta"
# Number of digits after decimal point
N_DIGITS=3
# ================
# LOGGER
# ================
# Fatal log message
fatal() {
printf '[FATAL] %s\n' "$@" >&2
exit 1
}
# Info log message
info() {
printf '[INFO ] %s\n' "$@"
}
# ================
# MAIN
# ================
{
# Check command 'bc' exist
command -v bc > /dev/null 2>&1 || fatal "Command 'bc' not found"
# Check file exist
[ -f "$FASTA_FILE" ] || fatal "File '$FASTA_FILE' not found"
# Count number of sequences
_n_sequences=$(grep --count '^>' "$FASTA_FILE")
info "Analyzing $_n_sequences sequences"
[ "$_n_sequences" -ne 0 ] || fatal "No sequences found"
# Remove sequence wrapping
_fasta_file_content=$(
sed 's/\(^>.*$\)/#\1#/' "$FASTA_FILE" \
| tr --delete "\r\n" \
| sed 's/$/#/' \
| tr "#" "\n" \
| sed '/^$/d'
)
# Vars
_sequence=
_a_count_total=0
_c_count_total=0
_g_count_total=0
_t_count_total=0
# Read line by line
while IFS= read -r _line; do
# Check if header
if printf '%s\n' "$_line" | grep --quiet '^>'; then
# Save sequence and continue
_sequence=${_line#?}
continue
fi
# Count
_a_count=$(printf '%s\n' "$_line" | tr --delete --complement 'A' | wc --bytes)
_c_count=$(printf '%s\n' "$_line" | tr --delete --complement 'C' | wc --bytes)
_g_count=$(printf '%s\n' "$_line" | tr --delete --complement 'G' | wc --bytes)
_t_count=$(printf '%s\n' "$_line" | tr --delete --complement 'T' | wc --bytes)
# Add current count to total
_a_count_total=$((_a_count_total + _a_count))
_c_count_total=$((_c_count_total + _c_count))
_g_count_total=$((_g_count_total + _g_count))
_t_count_total=$((_t_count_total + _t_count))
# Calculate GC content
_gc=$(
printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
"$N_DIGITS" "$_a_count" "$_c_count" "$_g_count" "$_t_count" \
| bc
)
# Add 0 before decimal point
_gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
info "Sequence '$_sequence' GC content: $_gc"
done << EOF
$_fasta_file_content
EOF
# Total data
info "Adenine total count: $_a_count_total"
info "Cytosine total count: $_c_count_total"
info "Guanine total count: $_g_count_total"
info "Thymine total count: $_t_count_total"
# Calculate total GC content
_gc=$(
printf 'scale = %d; a = %d; c = %d; g = %d; t = %d; (g + c) / (a + c + g + t)\n' \
"$N_DIGITS" "$_a_count_total" "$_c_count_total" "$_g_count_total" "$_t_count_total" \
| bc
)
# Add 0 before decimal point
_gc="$(printf "%.${N_DIGITS}f\n" "$_gc")"
info "GC content: $_gc"
}
“计数序列数”和“删除序列包装”代码改编自https://www.biostars.org/p/17680
该脚本仅使用除 bc 之外的基本命令来进行精度计算(请参阅 bc 安装)。
您可以通过修改
CONFIGURATION
部分中的变量来配置脚本。
因为您没有指明您想要哪一个,所以会计算每个序列和整体的 GC 含量。因此,扔掉任何不必要的东西:)
尽管我缺乏生物信息学背景,但该脚本成功解析和分析了 fasta 文件。