有任何方法可以在输出中使用通配符？

Question

在我的Snakefile中，我使用

SPLIT2NFILES

存储拆分数，然后存储一个列表，该列表应使用后缀ID分配到每个样本。

SPLIT_IDS

这是我的split_part规则。我想使用此规则将每个样本分为不同数量的零件，该零件根据配置文件和

SPLIT2NFILES = config["split2nfiles"]
SPLIT_IDS = defaultdict(list)
for k in SAMPLE2DATA.keys():
    SPLIT_IDS[k]=[f"{num:04d}" for num in range(1, SPLIT2NFILES[k] + 1)]

和

SPLIT2NFILES

。我的输入文件都喜欢{示例}。{ref} .sam。 {sample}在

SPLIT_IDS

中，{ref}只是

SAMPLE2DATA.keys()

。但是每个样本应分为不同的数字（ID不相同，例如样本1分为10个零件，然后将ID从“ 0001”到“ 0010”。但是样本2分为13个零件，然后从“ 0001获取ID” “ to“ 0013”）我希望此规则仅生成DAG中每个样本的零件（我有3个示例，这意味着此规则只能以3次调用），但是当我生成dag graph

genome

时，该图从规则

snakemake --dag  | dot -Tpng > dag_dyn_ids.png

开始。。

revert_split_bam

之前的所有规则（包括本身）都消失了。

split_part

有任何可能的方法来指定

rule split_part: input: ready=TEMPDIR / "bam2sam_barrier/ready", sams=TEMPDIR / "split_part/sams/{sample}.{ref}.sam", output: expand(TEMPDIR / "split_part/splits/{{sample}}/{{sample}}.{{ref}}_{id}.sam", id=SPLIT_IDS["{{sample}}"] ), params: n_file=lambda wildcards : SPLIT2NFILES[wildcards.sample], tmp=os.environ["TMPDIR"], sample_name = lambda wildcards: wildcards.sample, shell: ........ rule revert_split_bam: input: rules.split_part.output output: temp(TEMPDIR / "split_part/bams/{sample}/{sample}.{ref}_{id}.bam"),

的输出，以匹配{sample}

我尝试了

split_part

，但似乎我不应该在

id

中使用lambda。错误是“只能将输入文件指定为函数”。

i我检查了此问题

https：//github.com/snakemake/snakemake/issues/1122

，我发现输出中使用参数无法正常工作。

简短的答案是“否”，您不能有规则的输出数量。规则的输出始终是模板字符串的固定列表。但是，您可以实现想要的东西，即。在块中处理您的SAM文件。

一种方法是让您的

id = lambda wildcards: SPLIT_IDS[wildcards.sample]

规则始终输出

expand

输出，然后在所有块数量较低的情况下，只需创建空文件即可填充输出（但从来没有实际尝试处理它们）。它会起作用，但有点奇怪。

或您可以重新运行每个样本的整个工作流程，而不是一次处理所有内容。这实际上可能是最简单的方法。曾经有一个支持此功能的“子工作流”功能，但是它被删除了，因为它实际上只是清洁，更简单地从shell脚本中的循环多次运行snakemake。其他方法是在分裂发生后使用

split_part

在DAG中添加新作业，但是在这种情况下，检查点是不必要的，因为您已经提前知道，您期望每个SAM文件会看到的块数量，因此检查点是多余的。但是，您仍然需要使

split

规则输出文件目录（这是您通常使用检查点规则执行的）。这是一个独立的例子： max(split2nfiles.values())

Answer 1

checkpoint

中的提取步骤简单地需要与目录中的文件进行符号链接。

最终选项可能是根据这些SAM文件中的内容而定的。您可以拥有一个规则，该规则每块运行一次，例如上面的

# Say I have three input files: FILE_SIZES = {"a": 12, "b": 15, "c": 14} # Ensure that f and p wildcards only match to letters and numbers respectively. wildcard_constraints: f = r"[a-z]", p = r"\d+", rule main: input: expand("doubled_{f}.txt", f=FILE_SIZES.keys()) rule combine_output: output: "doubled_{f}.txt" input: lambda wc: expand( "doubled_{f}_{p:02d}.txt", f = wc.f, p = range(FILE_SIZES[wc.f]) ) shell: "cat {input} > {output}" # As a proxy for a rule that processes a chunk of a BAM file, this just # doubles the content of a line, eg: "foo" -> "foo foo" rule double_up: output: "doubled_{f}_{p}.txt" input: "infile_{f}_{p}.txt" shell: "paste {input} {input} > {output}" # You could remove this rule and avoid making the symlinks if you just have the # "double_up" rule read the input directly, but I think it's cleaner to keep # this "shim" rule. # The explicit "test -e" avoids making dangling links if a file is somehow missing, # which is quite possible as we are relying on the filenames created by "split". rule link_a_chunk: output: "infile_{f}_{p}.txt" input: "infile_{f}_split" shell: """test -e {input}/{output} ln -srn {input}/{output} {output} """ # This rule has a directory output, as that's the only way to have a rule with a # variable number of output files. rule split_input: output: directory("infile_{f}_split") input: "infile_{f}.txt" shell: """mkdir {output} split -l1 {input} --numeric=0 {output}/infile_{wildcards.f}_ --add .txt """ rule gen_inputs: output: "infile_{f}.txt" params: num_lines = lambda wc: FILE_SIZES[wc.f] shell: "seq {params.num_lines} > {output}"

，而是预摘输入的每个作业将直接读取大型SAM文件，并且一次只提取一个块。当然，如果这需要在整个Big Sam文件中进行扫描15次，那么这将很慢，但是也许如果它们被转换为BAM，索引并使用“ Samtools View -M”提取，则可能是有效的？取决于您如何进行分裂。

有任何方法可以在输出中使用通配符？

问题描述投票：0回答：1

1个回答

最新问题

有任何方法可以在输出中使用通配符？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1