我有一个检查点,导致未知数量的文件,需要在执行后重新评估 DAG。我想通过染色体并行化命令来加速它。因此,在重新评估后,我会通过实验合并所有染色体。然而,snakemake 无法推断染色体。
checkpoint GenomeAnalysisTK:
input:
bamlist = rules.RealignerTargetCreator.output.bamlist,
intervals = rules.RealignerTargetCreator.output.intervals,
fasta = fasta
output:
temp(directory("splits/{chromosome}"))
conda:
"gatk3"
wildcard_constraints:
chromosome='|'.join([x for x in detect_chromosomes(fai)]),
shell:
"""
mkdir -p {output} && cd {output}
gatk3 -Xmx24g -T IndelRealigner -I {input.bamlist} -targetIntervals {input.intervals} -L {wildcards.chromosome} -R {input.fasta} -compress 0 --nWayOut .{wildcards.chromosome}.indelrealigned.bam
"""
def agg(wildcards):
output=checkpoints.GenomeAnalysisTK.get(**wildcards).output[0]
return expand("splits/{chromosome}/{{experiment}}/{chromosome}.indelrealigned.bam")
rule merge_realigned:
input:
agg
output:
"{patient}/{sample}/{experiment}.merged.indelrealigned.bam"
threads:
config["other_threads"],
params:
compression_level = 0
wildcard_constraints:
chromosome='|'.join([x for x in detect_chromosomes(fai)]),
shell:
"samtools merge -@ {threads} -l {params.compression_level} {output} {input}"
但是,我得到了典型的“工作流程错误:缺少染色体的通配符值”。我怎样才能让它推断染色体?
问题是
merge_realigned
规则没有用于匹配染色体的通配符,因此您必须在输入函数中指定它。然而,你的规则取决于所有染色体,所以你必须首先获得所有染色体的输出:
def agg(wildcards):
for chrom in CHROMSOME_LIST:
checkpoints.GenomeAnalysisTK.get(chromosome=chrom, **wildcards).output
return expand("splits/{chromosome}/{{experiment}}/{chromosome}.indelrealigned.bam",
chromosome=CHROMOSOME_LIST)
并且您还必须在扩展语句中指定染色体。
如果第一个检查点必须在请求第二个检查点之前完成,则 for 循环构造可能会阻止并行执行,我不确定情况是否如此。