我目前面临 Snakemake 的问题,我希望有人可以帮助我解决它。我在互联网上搜索了类似问题的解决方案,但没有找到任何可以解决我的特定问题的方法。 我不是 Snakemake 的高级用户,我想知道 Snakemake 是否有针对此类问题的内置解决方案。 如果我能得到原因和方式的解释,那就太好了。
以下是我的具体情况:
我希望我的 Snakemake 代码能够简单地允许我在一个规则中解压缩文件并在另一规则中检索解压缩的文件以执行简单的处理任务。这是重现我所面临的错误的代码片段(第 1 部分)。我临时提出了一个解决方案,使我的代码能够正常工作(下面提供的代码的第 2 部分)。我对此解决方案不满意,该解决方案涉及将文件解压缩与数据处理结合在同一规则中(在本例中,使用 R 语言),我正在寻求有关如何使选项 1 发挥作用的指导。 Snakemake 中是否有选项允许我将这两个规则分开,如以下代码的所谓第 1 部分中指定的那样?
非常感谢您的帮助
# Import necessary modules
import os
import numpy as np
import pandas as pd
import zipfile
import shutil
# Create necessary directories and files
os.makedirs("path/to/my/file/", exist_ok=True)
os.makedirs("path/to/my/compressed", exist_ok=True)
os.makedirs("path/to/my/uncompressed", exist_ok=True)
# Generate random data and create a Pandas DataFrame
x = np.random.rand(3, 2)
df = pd.DataFrame(data=x.astype(float))
df.to_csv("path/to/my/file/data.csv")
# Compress the CSV file into a zip file
with zipfile.ZipFile('path/to/my/compressed/myfile.zip', 'w') as z:
z.write("path/to/my/file/data.csv", "data.csv")
# Main rule (rule all) specifying expected output files
rule all:
input:
i1='path/to/my/uncompressed',
i2='path/to/my/Routput/data.csv'
# Part 1: Set of rules causing an error
# The first rule unzips a file from the compressed directory and stores it in the uncompressed directory.
# The second rule reads the unzipped file in R and rewrites it in the Routput directory.
# This way of specifying the rule does not work and produces an error: MissingInputException at line 37 of the Snakefile.
rule uncompress:
input:
'path/to/my/compressed/myfile.zip'
output:
directory('path/to/my/uncompressed')
shell:
"""
unzip {input} -d {output}
"""
rule load_data:
input:
'path/to/my/uncompressed/data.csv'
output:
'path/to/my/Routput/data.csv'
shell:
"""
Rscript -e "x={input}; y={output}; X=read.csv(x, header=T, sep=','); write.csv(X, y)"
"""
# Part 2: Functional solution but needs to remove the second rule and embed the logic of the second rule within the first rule
# Comment part 1 and uncomment part 2 to execute.
# Data loading is declared with the former first rule. This way works fine without any error, but I want to avoid it.
# rule uncompress:
# input:
# 'path/to/my/compressed/myfile.zip'
# output:
# o1=directory('path/to/my/uncompressed'),
# o2='path/to/my/Routput/data.csv'
# params:
# p1='path/to/my/uncompressed/data.csv'
# shell:
# """
# unzip {input} -d {output.o1}
# Rscript -e "x='{params.p1}' ;y= '{output.o2}'; X=read.csv(x, header=T, sep=','); write.csv(X, file=y)"
# """
Snakemake版本:5.10.0
Python版本:3.8.10
执行环境:
Linux 发行版:说明:Ubuntu 20.04.6 LTS
发布:20.04 代号:focal
您希望规则
uncompress
生成 data.csv
,因为这是后面的规则将要使用的。因此添加 data.csv
作为输出文件。为了避免在规则解压缩中将路径硬编码到 data.csv
两次,您可以从 data.csv 的路径中提取目录名称。例如:
import os
rule uncompress:
input:
'path/to/my/compressed/myfile.zip'
output:
csv='path/to/my/uncompressed/data.csv',
params:
d=lambda wc, output: os.path.dirname(output.csv),
shell:
r"""
unzip -o {input} -d {params.d}
"""