I'll try to answer your question B, and give extra details that I hope can be useful for you and others.
Edit: I added some attempts at answering question C at the end.
Regarding quotes
First, regarding what you call "inverted" commas, they are usually called "single quotes", and they are used in python to build strings. Double quotes can also be used for the same purpose. The main difference is when you try to create strings that contain quotes. Using double quotes allows you to create strings containing single quotes, and vice-versa. Otherwise, you need to "escape" the quote using backslashes (""):
s1 = 'Contains "double quotes"'
s1_bis = "Contains "double quotes""
s2 = "Contains 'single quotes'"
s2_bis = 'Contains 'single quotes''
(I tend to prefer double quotes, that's just a personal taste.)
Decomposing the example
rule trim_galore_pe:
input:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
You are assigning a function (lambda wildcards: ...
) to a variable (sample
), which happens to belong to the input section of a rule.
This will cause snakemake to use this function when it comes to determine the input of a particular instance of the rule, based on the current values of the wildcards (as inferred from the current value of the output it wants to generate).
For clarity, one could very likely rewrite this by separating the function definition from the rule declaration, without using the lambda
construct, and it would work identically:
def determine_sample(wildcards):
return expand(
f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz",
num=[1,2])
rule trim_galore_pe:
input:
sample = determine_sample
expand
is a snakemake-specific function (but you can import it in any python program or interactive interpreter with from snakemake.io import expand
), that makes it easier to generate lists of strings. In the following interactive python3.6 session we will try to reproduce what happens when you use it, using different native python constructs.
Accessing the configuration
# We'll try to see how `expand` works, we can import it from snakemake
from snakemake.io import expand
?
# We want to see how it works using the following example
# expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
# To make the example work, we will first simulate the reading
# of a configuration file
import yaml
config_text = """
samples:
Corces2016_4983.7A_Mono: fastq_files/SRR2920475
Corces2016_4983.7B_Mono: fastq_files/SRR2920476
cell_types:
Mono:
- Corces2016_4983.7A
index: /home/genomes_and_index_files/hg19
"""
# Here we used triple quotes, to have a readable multi-line string.
?
# The following is equivalent to what snakemake does with the configuration file:
config = yaml.load(config_text)
config
Output:
{'cell_types': {'Mono': ['Corces2016_4983.7A']},
'index': '/home/genomes_and_index_files/hg19',
'samples': {'Corces2016_4983.7A_Mono': 'fastq_files/SRR2920475',
'Corces2016_4983.7B_Mono': 'fastq_files/SRR2920476'}}
We obtained a dictionary in which the key "samples" is associated with a nested dictionary.
# We can access the nested dictionary as follows
config["samples"]
# Note that single quotes could be used instead of double quotes
# Python interactive interpreter uses single quotes when it displays strings
Output:
{'Corces2016_4983.7A_Mono': 'fastq_files/SRR2920475',
'Corces2016_4983.7B_Mono': 'fastq_files/SRR2920476'}
# We can access the value corresponding to one of the keys
# again using square brackets
config["samples"]["Corces2016_4983.7A_Mono"]
Output:
'fastq_files/SRR2920475'
# Now, we will simulate a `wildcards` object that has a `sample` attribute
# We'll use a namedtuple for that
# https://docs.python.org/3/library/collections.html#collections.namedtuple
from collections import namedtuple
Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(sample="Corces2016_4983.7A_Mono")
wildcards.sample
Output:
'Corces2016_4983.7A_Mono'
Edit (15/11/2018): I found out a better way of creating wildcards:
from snakemake.io import Wildcards
wildcards = Wildcards(fromdict={"sample": "Corces2016_4983.7A_Mono"})
# We can use this attribute as a key in the nested dictionary
# instead of using directly the string
config["samples"][wildcards.sample]
# No quotes here: `wildcards.sample` is a string variable
Output:
'fastq_files/SRR2920475'
Deconstructing the expand
# Now, the expand of the example works, and it results in a list with two strings
expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
# Note: here, single quotes are used for the string "sample",
# in order not to close the opening double quote of the whole string
Output:
['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']
# Internally, I think what happens is something similar to the following:
filename_template = f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz"
# This template is then used for each element of this "list comprehension"
[filename_template.format(num=num) for num in [1, 2]]
Output:
['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']
# This is equivalent to building the list using a for loop:
filenames = []
for num in [1, 2]:
filename = filename_template.format(num=num)
filenames.append(filename)
filenames
Output:
['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']
String templates and formatting
# It is interesting to have a look at `filename_template`
filename_template
Output:
'fastq_files/SRR2920475_{num}.fastq.gz'
# The part between curly braces can be substituted
# during a string formatting operation:
"fastq_files/SRR2920475_{num}.fastq.gz".format(num=1)
Output:
'fastq_files/SRR2920475_1.fastq.gz'
Now let's further show how string formatting can be used.
# In python 3.6 and above, one can create formatted strings
# in which the values of variables are interpreted inside the string
# if the string is prefixed with `f`.
# That's what happens when we create `filename_template`:
filename_template = f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz"
filename_template
Output:
'fastq_files/SRR2920475_{num}.fastq.gz'
Two substitutions happened during the formatting of the string:
The value of config['samples'][wildcards.sample]
was used to make the first part of the string. (Single quotes were used around sample
because this python expression was inside a string built with double quotes.)
The double brackets around num
were reduced to single ones as part of the formatting operation. That's why we can then use this again in further formatting operations involving num
.
# Equivalently, without using 3.6 syntax:
filename_template = "{filename_prefix}_{{num}}.fastq.gz".format(
filename_prefix = config["samples"][wildcards.sample])
filename_template
Output:
'fastq_files/SRR2920475_{num}.fastq.gz'
# We could achieve the same by first extracting the value
# from the `config` dictionary
filename_prefix = config["samples"][wildcards.sample]
filename_template = f"{filename_prefix}_{{num}}.fastq.gz"
filename_template
Output:
'fastq_files/SRR2920475_{num}.fastq.gz'
# Or, equivalently:
filename_prefix = config["samples"][wildcards.sample]
filename_template = "{filename_prefix}_{{num}}.fastq.gz".format(
filename_prefix=filename_prefix)
filename_template
Output:
'fastq_files/SRR2920475_{num}.fastq.gz'
# We can actually perform string formatting on several variables
# at the same time:
filename_prefix = config["samples"][wildcards.sample]
num = 1
"{filename_prefix}_{num}.fastq.gz".format(
filename_prefix=filename_prefix, num=num)
Output:
'fastq_files/SRR2920475_1.fastq.gz'
# Or, using 3.6 formatted strings
filename_prefix = config["samples"][wildcards.sample]
num = 1
f"{filename_prefix}_{num}.fastq.gz"
Output:
'fastq_files/SRR2920475_1.fastq.gz'
# We could therefore build the result of the expand in a single step:
[f"{config['samples'][wildcards.sample]}_{num}.fastq.gz" for num in [1, 2]]
Output:
['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']
Comments about question C
The following is a bit complex, in terms of how Python will build the string:
input:
lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
But it should work, as we can see in the following simulation:
from collections import namedtuple
from snakemake.io