使用 Snakemake 运行 Picard 或 GATK MergeVcfs 时出错

Question

我是 Snakemake 的新手。我尝试构建一个小型管道来合并两个 VCF 文件，但我不断收到错误。以前有人遇到过这个问题吗？

这是我合并两个 VCF 文件的原始代码：

ID = [line.strip() for line in open("0.rawdata_fastq/sample-pID.txt", "r")]


from datetime import datetime
date = datetime.now().strftime("%Y%m%d_%H%M%S")

import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts



rule all_variant_filtering:
    input:
        expand("8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_SNP.vcf.gz", pID=ID),
        expand("8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_InDel.vcf.gz", pID=ID),
        expand("8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_SNP.vcf.gz", pID=ID),
        expand("8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_InDel.vcf.gz", pID=ID),
        expand("8.variant_filter/3.MergeVcfs/{pID}_MergeVcfs/{pID}.vronaly.vcf.gz", pID=ID)



# Step 8-1 : SNPs VariantFiltration
rule snp_filter:
    input:
        vcf="7.variant_calling/{pID}_variant_calling/{pID}.vcf.gz",
        ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
        gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
    output:
        vcf_snp="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_SNP.vcf.gz"
    log:
        vcf_snp_log="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}-VariantFiltration.SNP." + date +".log"
    message:
        "\nStep 8-1: VariantFiltration: Filter variant calls based on INFO and/or FORMAT annotations (SNP)"
    shell:
        """
        {input.gatk} VariantFiltration \
        --reference {input.ref} \
        --variant {input.vcf} \
        --filter-expression "QD < 2.0" --filter-name "QD" \
        --filter-expression "QUAL < 30.0" --filter-name "QUAL30" \
        --filter-expression "SOR > 3.0" --filter-name "SOR" \
        --filter-expression "FS > 60.0" --filter-name "FS" \
        --filter-expression "MQ < 40.0" --filter-name "MQ" \
        --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum" \
        --filter-expression "ReadPosRankSum < -8.0" --filter-name "RPRS" \
        --output {output.vcf_snp} 2>&1 | tee -a {log.vcf_snp_log}
        """



# Step 8-2 : InDel VariantFiltration
rule indel_filter:
    input:
        vcf="7.variant_calling/{pID}_variant_calling/{pID}.vcf.gz",
        ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
        gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
    output:
        vcf_indel="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_InDel.vcf.gz"
    log:
        vcf_indel_log="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}-VariantFiltration.InDel." + date +".log"
    message:
        "\nStep 8-2: VariantFiltration: Filter variant calls based on INFO and/or FORMAT annotations (InDel)"
    shell:
        """
        {input.gatk} VariantFiltration \
        --reference {input.ref} \
        --variant {input.vcf} \
        --filter-expression "QD < 2.0" --filter-name "QD" \
        --filter-expression "SOR > 10.0" --filter-name "SOR" \
        --filter-expression "FS > 200.0" --filter-name "FS" \
        --filter-expression "ReadPosRankSum < -20.0" --filter-name "RPRS" \
        --output {output.vcf_indel} 2>&1 | tee -a {log.vcf_indel_log}
        """



# Step 8-3 : SNPs SelectVariants
rule snp_select:
    input:
        snp_vcf="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_SNP.vcf.gz",
        ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
        gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
    output:
        vcf_snp_filter="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_SNP.vcf.gz"
    log:
        vcf_snp_filter_log="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}-SelectVariants.SNP." + date +".log"
    message:
        "\nStep 8-3 : SelectVariants: Select a subset of variants from a VCF file (SNP)"
    shell:
        """
        {input.gatk} SelectVariants \
        --reference {input.ref} \
        --variant {input.snp_vcf} \
        --output {output.vcf_snp_filter} \
        --select-type-to-include SNP --select-type-to-include MNP 2>&1 | tee -a {log.vcf_snp_filter_log}
        """



# Step 8-4 : InDels SelectVariants
rule indel_select:
    input:
        indel_vcf="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_InDel.vcf.gz",
        ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
        gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
    output:
        vcf_indel_filter="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_InDel.vcf.gz"
    log:
        vcf_indel_filter_log="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}-SelectVariants.InDel." + date +".log"
    message:
        "\nStep 8-4 : SelectVariants: Select a subset of variants from a VCF file (InDel)"
    shell:      
        """
        {input.gatk} SelectVariants \
        --reference {input.ref} \
        --variant {input.indel_vcf} \
        --output {output.vcf_indel_filter} \
        --select-type-to-include INDEL --select-type-to-include MIXED 2>&1 | tee -a {log.vcf_indel_filter_log}
        """

# Step 8-5 : Merge selected.SNP and selected.InDel to a VCF.gz file
rule mergr_to_vcf:  
    input:     
        VCF_SNP="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_SNP.vcf.gz", 
        VCF_INDEL="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_InDel.vcf.gz"
    output:  
        VCF_MERGE="8.variant_filter/3.MergeVcfs/{pID}_MergeVcfs/{pID}.vronaly.vcf.gz"
    log:
        merge_log="8.variant_filter/3.MergeVcfs/{pID}_MergeVcfs/{pID}-MergeVcfs." + date +".log"
    message:
        "\nStep 8-5 : Merge to a VCF file."
    shell:
        "gatk MergeVcfs -I {input.VCF_SNP} -I {input.VCF_INDEL} -O {output.VCF_MERGE} 2>&1 | tee -a {log.merge_log}"

直到第8-5步一切都运行成功，错误消息是

IndentationError in file <string>, line 11:
unindent does not match any outer indentation level:
        VCF_INDEL="/home/fanny/Documents/SRR1695644.selected_InDel.vcf"
  File "/home/fanny/miniforge3/envs/snakemake/lib/python3.12/tokenize.py", line 541, in _generate_tokens_from_c_tokenizer
  File "/home/fanny/miniforge3/envs/snakemake/lib/python3.12/tokenize.py", line 537, in _generate_tokens_from_c_tokenizer

我不明白为什么 tokenize.py 中有错误，因为据我所知，它是默认的 python 脚本。

所以我尝试另外两种方法来解决这个问题。一种是在 shell 中将 gatk 更改为 picard，但遇到了同样的错误。另一个是使用包装器：“v3.9.0/bio/picard/mergevcfs”，我收到新的错误消息：

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job stats:
job           count
----------  -------
merge_vcfs        1
total             1

Select jobs to execute...
Execute 1 jobs...

[Tue May 14 09:50:23 2024]
Job 0: 
Step 8-5 : Merge to a VCF file.
Reason: Missing output files: /home/fanny/Documents/SRR1695644.vronaly.vcf.gz

Traceback (most recent call last):
  File "/home/fanny/Documents/.snakemake/scripts/tmpydivs5px.wrapper.py", line 24, in <module>
    shell(
  File "/home/fanny/miniforge3/envs/snakemake/lib/python3.12/site-packages/snakemake/shell.py", line 297, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  picard MergeVcfs  -Xmx164M -Djava.io.tmpdir=/tmp  --INPUT /home/fanny/Documents/SRR1695644.selected_SNP.vcf --INPUT /home/fanny/Documents/SRR1695644.selected_InDel.vcf --TMP_DIR /tmp/tmpkkyq14g1 --OUTPUT /home/fanny/Documents/SRR1695644.vronaly.vcf.gz  2> /home/fanny/Documents/SRR1695644-MergeVcfs.log' returned non-zero exit status 127.
RuleException:
CalledProcessError in file /home/fanny/Documents/test.smk, line 22:
Command 'set -euo pipefail;  /home/fanny/miniforge3/envs/snakemake/bin/python3.12 /home/fanny/Documents/.snakemake/scripts/tmpydivs5px.wrapper.py' returned non-zero exit status 1.
[Tue May 14 09:50:25 2024]
Error in rule merge_vcfs:
    jobid: 0
    input: /home/fanny/Documents/SRR1695644.selected_SNP.vcf, /home/fanny/Documents/SRR1695644.selected_InDel.vcf
    output: /home/fanny/Documents/SRR1695644.vronaly.vcf.gz
    log: /home/fanny/Documents/SRR1695644-MergeVcfs.log (check log file(s) for error details)
    conda-env: /home/fanny/Documents/.snakemake/conda/bf54fead3e98a647607f11e9b2421b3c_

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-05-14T095023.353231.snakemake.log
WorkflowError:
At least one job did not complete successfully.

有人可以帮我解决这个问题吗？

非常感谢！

Answer 1

错误：

non-zero exit status 127

通常表示未找到命令。检查

picard

是否是您的

$PATH

中的有效命令，而不是 shell 别名。

最简单的检查方法 - 在 shell 中运行

type picard

。

关于 IndentationError，这些类型的语法错误消息引用

tokenize.py

是正常的，即使问题出在您的 Snakefile 中，并且

tokenize.py

只是检测到问题的文件。这可能是 Snakemake 中应该修复的问题。

使用 Snakemake 运行 Picard 或 GATK MergeVcfs 时出错

问题描述投票：0回答：1

1个回答

最新问题

使用 Snakemake 运行 Picard 或 GATK MergeVcfs 时出错

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1