我正在尝试使用一些文件夹/目录名称创建通配符,这些文件夹/目录名称是从创建“Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}”文件夹的规则 ReferenceDatabase 中输出的( cluster1、cluster2、... 对应于目录名通配符),但我我无法知道第一次运行此规则时将创建多少个“集群”目录。所以我尝试编写如下的 Snakefile:
import glob
# Need sample name and dirname
SAMPLES, = glob_wildcards("Campylobacter/core_genome/core/{sample}.fa.align")
dirnames, = glob_wildcards("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}", "Campylobacter/Gene_Flow/DatabaseQuery/{dirname}/{dirname}")
wildcard_constraints:
dirname="cluster[0-9]+"
rule all:
input:
distmat_out = "Campylobacter/ANI_results/ani/ani.distmat",
parse_distances_out = "Campylobacter/ANI_results/genome_pairs.csv",
cluster_genomes_out = "Campylobacter/ANI_results/cluster_genomes.csv",
liste_genomes = expand("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/path_to_genome_list.txt", dirname=dirnames),
core_genome_within_species = expand("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/concat.fa", dirname=dirnames),
distances_between_genomes_r = expand("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/distances.dist", dirname=dirnames)
rule define_ANI_species:
input:
fasta = "Campylobacter/core_genome/concat.fa",
dir = "Campylobacter"
output:
distmat = "Campylobacter/ANI_results/ani/ani.distmat",
parse_distances = "Campylobacter/ANI_results/genome_pairs.csv",
cluster_genomes = "Campylobacter/ANI_results/cluster_genomes.csv",
shell:
"""
mkdir -p Campylobacter/ANI_results/ani
distmat -sequence {input.fasta} -nucmethod 0 -outfile {output.distmat}
python pipelines/ANI/parse_distances.py {input.dir}
python pipelines/ANI/cluster_genomes.py {input.dir}
"""
rule ReferenceDatabase:
input:
cluster_genomes = "Campylobacter/ANI_results/cluster_genomes.csv",
dir = "Campylobacter"
output:
liste = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/path_to_genome_list.txt"
shell:
"python pipelines/ConSpecifix/create_Refdb.py {input.dir}"
rule core_genome_within_species:
input:
dir = "Campylobacter/genomes",
liste = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/path_to_genome_list.txt"
output:
fasta = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/concat.fa",
family = "Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome/families_core.txt"
params:
dir = directory("Campylobacter/Gene_Flow/ReferenceDatabase/{dirname}/core_genome")
shell:
"python pipelines/CoreCruncher/corecruncher_master.py -in {input.dir} -out {params.dir} -list {input.liste} -freq 85 -prog usearch -ext .fa -length 80 -score 70 -align mafft"
我收到此错误:
rule ReferenceDatabase:
input: Campylobacter/ANI_results/genome_clusters.csv, Campylobacter
output: Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
jobid: 18
wildcards: dirname=cluster[0-9]+
Waiting at most 5 seconds for missing files.
MissingOutputException in line 171 of /Users/home//Bioinformatic_tool/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
Campylobacter/Gene_Flow/ReferenceDatabase/cluster[0-9]+/path_to_genome_list.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
snakemake 似乎无法识别使用“[0-9]+”的正则表达式 是否有类似 int 的通配符可以用来匹配: cluster1、 cluster2 、 cluster3 ...? (目录 1、目录 2、目录 3 ...?)
仅对文件编号使用通配符:
"Campylobacter/Gene_Flow/ReferenceDatabase/cluster{num}"