我使用snakemake并行处理8个文件(fastq)。然后,对每个文件进行解复用,然后再次使用snakemake对由每个文件生成的解复用文件进行并行处理。
我的第一次尝试(效果很好)是使用2个蛇文件。
我只想使用一个蛇文件。
这里有2条蛇文件的解决方案:
用于并行处理8个文件(通配符{run}
)的快照文件#1
configfile: "config.yaml"
rule all:
input:
expand("{folder}{run}_R1.fastq.gz", run=config["fastqFiles"],folder=config["fastqFolderPath"]),
expand('assembled/{run}/{run}.fastq', run=config["fastqFiles"]),
expand('assembled/{run}/{run}.ali.fastq', run=config["fastqFiles"]),
expand('assembled/{run}/{run}.ali.assigned.fastq', run=config["fastqFiles"]),
expand('assembled/{run}/{run}.unidentified.fastq', run=config["fastqFiles"]),
expand('log/remove_unaligned/{run}.log',run=config["fastqFiles"]),
expand('log/illuminapairedend/{run}.log',run=config["fastqFiles"]),
expand('log/assign_sequences/{run}.log',run=config["fastqFiles"]),
expand('log/split_sequences/{run}.log',run=config["fastqFiles"])
include: "00-rules/assembly.smk"
include: "00-rules/demultiplex.smk"
snakefile#2,用于并行处理生成的多路分解的文件
SAMPLES, = glob_wildcards('samples/{sample}.fasta')
rule all:
input:
expand('samples/{sample}.uniq.fasta',sample=SAMPLES),
expand('samples/{sample}.l.u.fasta',sample=SAMPLES),
expand('samples/{sample}.r.l.u.fasta',sample=SAMPLES),
expand('samples/{sample}.c.r.l.u.fasta',sample=SAMPLES),
expand('log/dereplicate_samples/{sample}.log',sample=SAMPLES),
expand('log/goodlength_samples/{sample}.log',sample=SAMPLES),
expand('log/clean_pcrerr/{sample}.log',sample=SAMPLES),
expand('log/rm_internal_samples/{sample}.log',sample=SAMPLES)
include: "00-rules/filtering.smk"
此解决方案运行良好。
是否可以通过这种方式将这两个蛇文件合并为一个?
configfile: "config.yaml"
rule all:
input:
expand("{folder}{run}_R1.fastq.gz", run=config["fastqFiles"],folder=config["fastqFolderPath"]),
expand('assembled/{run}/{run}.fastq', run=config["fastqFiles"]),
expand('assembled/{run}/{run}.ali.fastq', run=config["fastqFiles"]),
expand('assembled/{run}/{run}.ali.assigned.fastq', run=config["fastqFiles"]),
expand('assembled/{run}/{run}.unidentified.fastq', run=config["fastqFiles"]),
expand('log/remove_unaligned/{run}.log',run=config["fastqFiles"]),
expand('log/illuminapairedend/{run}.log',run=config["fastqFiles"]),
expand('log/assign_sequences/{run}.log',run=config["fastqFiles"]),
expand('log/split_sequences/{run}.log',run=config["fastqFiles"])
include: "00-rules/assembly.smk"
include: "00-rules/demultiplex.smk"
SAMPLES, = glob_wildcards('samples/{sample}.fasta')
rule all:
input:
expand('samples/{sample}.uniq.fasta',sample=SAMPLES),
expand('samples/{sample}.l.u.fasta',sample=SAMPLES),
expand('samples/{sample}.r.l.u.fasta',sample=SAMPLES),
expand('samples/{sample}.c.r.l.u.fasta',sample=SAMPLES),
expand('log/dereplicate_samples/{sample}.log',sample=SAMPLES),
expand('log/goodlength_samples/{sample}.log',sample=SAMPLES),
expand('log/clean_pcrerr/{sample}.log',sample=SAMPLES),
expand('log/rm_internal_samples/{sample}.log',sample=SAMPLES)
include: "00-rules/filtering.smk"
所以我必须再次定义rule all
。
并且我收到以下消息错误:
The name all is already used by another rule
它们是拥有许多rule all
的一种方法还是“使用许多snakefile”的解决方案是唯一的一种可能?
我想以最合适的方式使用snakemake。
您不受顶级规则的命名限制。您可以将其命名为all
,也可以重命名它:唯一重要的是它们的定义顺序。默认情况下,Snakemake将第一个规则作为目标规则,然后构造依赖关系图。
考虑到您有几种选择。首先,您可以将工作流程中的两个顶级规则合并为一个。归根结底,all
规则除了目标文件的定义外什么也不做。接下来,您可以将规则重命名为all1
和all2
(如果在命令行中指定,则可以运行单个工作流),并为all
规则提供合并的输入。最后,您可以使用子工作流程,但是只要您打算将两个脚本压缩为一个,那将是一个过大的杀伤力。
另一个可能有用的提示:如果您为每次运行定义了不同的输出,则无需为每个文件指定模式expand('filename{sample}',sample=config["fastqFiles"])
。例如:
rule sample:
input:
'samples/{sample}.uniq.fasta',
'samples/{sample}.l.u.fasta',
'samples/{sample}.r.l.u.fasta',
'samples/{sample}.c.r.l.u.fasta',
'log/dereplicate_samples/{sample}.log',
'log/goodlength_samples/{sample}.log',
'log/clean_pcrerr/{sample}.log',
'log/rm_internal_samples/{sample}.log'
output:
temp('flag_sample_{sample}_complete')
在这种情况下,all
规则变得微不足道:
rule all:
input: expand('flag_sample_{sample}_complete', sample=SAMPLES)
或者,正如我之前建议的那样:
rule all:
input: expand('flag_run_{run}_complete', run=config["fastqFiles"]),
input: expand('flag_sample_{sample}_complete', sample=SAMPLES)
rule all1:
input: expand('flag_run_{run}_complete', run=config["fastqFiles"])
rule all2:
input: expand('flag_sample_{sample}_complete', sample=SAMPLES)