存在一个包含配对读取的目录,配对的名称和数量未知,可能的文件扩展名是.fastq或.fastq.gz。
我需要有关如何最好地在目录中进行配对读取以便在 Snakemake 管道中进行处理的建议。
有多种方法可以解决这个问题。我的首选方法是使用 CSV 来指定读取路径。这样做的另一个好处是,您可以将元数据添加到每组读取中,然后您可以在工作流程中访问这些元数据。
以下是三个样本的映射读取示例:
A, B, C
目录结构:
.
├── Snakefile
├── data
│ ├── genome
│ │ └── ref.fa
│ └── reads
│ ├── a_R1.fastq.gz
│ ├── a_R2.fastq.gz
│ ├── b_R1.fastq
│ ├── b_R2.fastq
│ ├── c_R1.fq.gz
│ └── c_R2.fq.gz
└── sample_reads.csv
sample_reads.csv
:
sample_id,read1,read2
a,data/reads/a_R1.fastq.gz,data/reads/a_R2.fastq.gz
b,data/reads/b_R1.fastq,data/reads/b_R2.fastq
c,data/reads/c_R1.fq.gz,data/reads/c_R2.fq.gz
Snakefile
:
import pandas as pd
sample_reads = pd.read_csv("./sample_reads.csv")
rule all:
"""
Get all sample_ids from dataframe and use in expand to make sam
files for each sample.
"""
input:
expand(
"results/alignments/{sample}.sam",
sample=sample_reads["sample_id"].tolist(),
),
rule align_reads:
"""
Use sample wildcard to lookup read paths in pandas dataframe.
Need to use input function to access sample wildcard value.
"""
input:
read_1=lambda wc: sample_reads[sample_reads["sample_id"] == wc.sample][
"read1"
].item(),
read_2=lambda wc: sample_reads[sample_reads["sample_id"] == wc.sample][
"read2"
].item(),
ref="data/genome/ref.fa",
output:
sam="results/alignments/{sample}.sam",
shell:
"bwa mem {input.ref} {input.read_1} {input.read_2} > {output.sam}"