我是 Snakemake 的新手。 我尝试构建一个小型管道来合并两个 VCF 文件,但我不断收到错误。以前有人遇到过这个问题吗?
这是我合并两个 VCF 文件的原始代码:
ID = [line.strip() for line in open("0.rawdata_fastq/sample-pID.txt", "r")]
from datetime import datetime
date = datetime.now().strftime("%Y%m%d_%H%M%S")
import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts
rule all_variant_filtering:
input:
expand("8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_SNP.vcf.gz", pID=ID),
expand("8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_InDel.vcf.gz", pID=ID),
expand("8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_SNP.vcf.gz", pID=ID),
expand("8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_InDel.vcf.gz", pID=ID),
expand("8.variant_filter/3.MergeVcfs/{pID}_MergeVcfs/{pID}.vronaly.vcf.gz", pID=ID)
# Step 8-1 : SNPs VariantFiltration
rule snp_filter:
input:
vcf="7.variant_calling/{pID}_variant_calling/{pID}.vcf.gz",
ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
output:
vcf_snp="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_SNP.vcf.gz"
log:
vcf_snp_log="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}-VariantFiltration.SNP." + date +".log"
message:
"\nStep 8-1: VariantFiltration: Filter variant calls based on INFO and/or FORMAT annotations (SNP)"
shell:
"""
{input.gatk} VariantFiltration \
--reference {input.ref} \
--variant {input.vcf} \
--filter-expression "QD < 2.0" --filter-name "QD" \
--filter-expression "QUAL < 30.0" --filter-name "QUAL30" \
--filter-expression "SOR > 3.0" --filter-name "SOR" \
--filter-expression "FS > 60.0" --filter-name "FS" \
--filter-expression "MQ < 40.0" --filter-name "MQ" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "RPRS" \
--output {output.vcf_snp} 2>&1 | tee -a {log.vcf_snp_log}
"""
# Step 8-2 : InDel VariantFiltration
rule indel_filter:
input:
vcf="7.variant_calling/{pID}_variant_calling/{pID}.vcf.gz",
ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
output:
vcf_indel="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_InDel.vcf.gz"
log:
vcf_indel_log="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}-VariantFiltration.InDel." + date +".log"
message:
"\nStep 8-2: VariantFiltration: Filter variant calls based on INFO and/or FORMAT annotations (InDel)"
shell:
"""
{input.gatk} VariantFiltration \
--reference {input.ref} \
--variant {input.vcf} \
--filter-expression "QD < 2.0" --filter-name "QD" \
--filter-expression "SOR > 10.0" --filter-name "SOR" \
--filter-expression "FS > 200.0" --filter-name "FS" \
--filter-expression "ReadPosRankSum < -20.0" --filter-name "RPRS" \
--output {output.vcf_indel} 2>&1 | tee -a {log.vcf_indel_log}
"""
# Step 8-3 : SNPs SelectVariants
rule snp_select:
input:
snp_vcf="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_SNP.vcf.gz",
ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
output:
vcf_snp_filter="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_SNP.vcf.gz"
log:
vcf_snp_filter_log="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}-SelectVariants.SNP." + date +".log"
message:
"\nStep 8-3 : SelectVariants: Select a subset of variants from a VCF file (SNP)"
shell:
"""
{input.gatk} SelectVariants \
--reference {input.ref} \
--variant {input.snp_vcf} \
--output {output.vcf_snp_filter} \
--select-type-to-include SNP --select-type-to-include MNP 2>&1 | tee -a {log.vcf_snp_filter_log}
"""
# Step 8-4 : InDels SelectVariants
rule indel_select:
input:
indel_vcf="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_InDel.vcf.gz",
ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
output:
vcf_indel_filter="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_InDel.vcf.gz"
log:
vcf_indel_filter_log="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}-SelectVariants.InDel." + date +".log"
message:
"\nStep 8-4 : SelectVariants: Select a subset of variants from a VCF file (InDel)"
shell:
"""
{input.gatk} SelectVariants \
--reference {input.ref} \
--variant {input.indel_vcf} \
--output {output.vcf_indel_filter} \
--select-type-to-include INDEL --select-type-to-include MIXED 2>&1 | tee -a {log.vcf_indel_filter_log}
"""
# Step 8-5 : Merge selected.SNP and selected.InDel to a VCF.gz file
rule mergr_to_vcf:
input:
VCF_SNP="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_SNP.vcf.gz",
VCF_INDEL="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_InDel.vcf.gz"
output:
VCF_MERGE="8.variant_filter/3.MergeVcfs/{pID}_MergeVcfs/{pID}.vronaly.vcf.gz"
log:
merge_log="8.variant_filter/3.MergeVcfs/{pID}_MergeVcfs/{pID}-MergeVcfs." + date +".log"
message:
"\nStep 8-5 : Merge to a VCF file."
shell:
"gatk MergeVcfs -I {input.VCF_SNP} -I {input.VCF_INDEL} -O {output.VCF_MERGE} 2>&1 | tee -a {log.merge_log}"
直到第8-5步一切都运行成功,错误消息是
IndentationError in file <string>, line 11:
unindent does not match any outer indentation level:
VCF_INDEL="/home/fanny/Documents/SRR1695644.selected_InDel.vcf"
File "/home/fanny/miniforge3/envs/snakemake/lib/python3.12/tokenize.py", line 541, in _generate_tokens_from_c_tokenizer
File "/home/fanny/miniforge3/envs/snakemake/lib/python3.12/tokenize.py", line 537, in _generate_tokens_from_c_tokenizer
我不明白为什么 tokenize.py 中有错误,因为据我所知,它是默认的 python 脚本。
所以我尝试另外两种方法来解决这个问题。 一种是在 shell 中将 gatk 更改为 picard,但遇到了同样的错误。 另一个是使用包装器:“v3.9.0/bio/picard/mergevcfs”,我收到新的错误消息:
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job stats:
job count
---------- -------
merge_vcfs 1
total 1
Select jobs to execute...
Execute 1 jobs...
[Tue May 14 09:50:23 2024]
Job 0:
Step 8-5 : Merge to a VCF file.
Reason: Missing output files: /home/fanny/Documents/SRR1695644.vronaly.vcf.gz
Traceback (most recent call last):
File "/home/fanny/Documents/.snakemake/scripts/tmpydivs5px.wrapper.py", line 24, in <module>
shell(
File "/home/fanny/miniforge3/envs/snakemake/lib/python3.12/site-packages/snakemake/shell.py", line 297, in __new__
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail; picard MergeVcfs -Xmx164M -Djava.io.tmpdir=/tmp --INPUT /home/fanny/Documents/SRR1695644.selected_SNP.vcf --INPUT /home/fanny/Documents/SRR1695644.selected_InDel.vcf --TMP_DIR /tmp/tmpkkyq14g1 --OUTPUT /home/fanny/Documents/SRR1695644.vronaly.vcf.gz 2> /home/fanny/Documents/SRR1695644-MergeVcfs.log' returned non-zero exit status 127.
RuleException:
CalledProcessError in file /home/fanny/Documents/test.smk, line 22:
Command 'set -euo pipefail; /home/fanny/miniforge3/envs/snakemake/bin/python3.12 /home/fanny/Documents/.snakemake/scripts/tmpydivs5px.wrapper.py' returned non-zero exit status 1.
[Tue May 14 09:50:25 2024]
Error in rule merge_vcfs:
jobid: 0
input: /home/fanny/Documents/SRR1695644.selected_SNP.vcf, /home/fanny/Documents/SRR1695644.selected_InDel.vcf
output: /home/fanny/Documents/SRR1695644.vronaly.vcf.gz
log: /home/fanny/Documents/SRR1695644-MergeVcfs.log (check log file(s) for error details)
conda-env: /home/fanny/Documents/.snakemake/conda/bf54fead3e98a647607f11e9b2421b3c_
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-05-14T095023.353231.snakemake.log
WorkflowError:
At least one job did not complete successfully.
有人可以帮我解决这个问题吗?
非常感谢!
错误:
non-zero exit status 127
通常表示未找到命令。 检查 picard
是否是您的 $PATH
中的有效命令,而不是 shell 别名。
最简单的检查方法 - 在 shell 中运行
type picard
。
关于 IndentationError,这些类型的语法错误消息引用
tokenize.py
是正常的,即使问题出在您的 Snakefile 中,并且 tokenize.py
只是检测到问题的文件。这可能是 Snakemake 中应该修复的问题。