# Prevent Snakemake from changing wildcards: sample for every rule

Bioinformatics Asked by Freek on August 28, 2021

EDIT: I posted this as a bug, perhaps wait for the result of that? github

All,

I am using Snakemake to build a pipeline. To define input and output names, I constructed them based on a filename base (from the samples table), a read indicator (like R1, R2) and the fastq file extension, like so:

samples = pd.read_csv('samples.tsv', sep='t').set_index('samples', drop=False)
fastq_file_extension = _001.fastq.gz

rule fastqc:
input:
output:
wrapper:
"0.49.0/bio/fastqc"


However, during pipeline execution already I can see strange things happening, for this rule it reports:

[Fri Feb 21 12:39:57 2020]
rule fastqc:
input: fastq_raw/0053_P2017BB3S20R_S2_R1_001.fastq.gz
output: qc/fastqc/0053_P2017BB3S20R_S2_R1_fastqc.html, qc/fastqc/0053_P2017BB3S20R_S2_R1_fastqc.zip
jobid: 10


As you can see, Snakemake pulled the R part of the read to the sample and set the read to 1 (should be R1). But this still works because it produces output files as Snakemake expects.

But the next rule, trim_galore_pe:

rule trim_galore_pe:
input:
output:
params:
extra="--retain_unpaired"
wrapper:
"0.49.0/bio/trim_galore/pe"


This goes wrong, during pipeline execution Snakemake reports:

[Fri Feb 21 12:40:37 2020]
Error in rule trim_galore_pe:
jobid: 7
output: fastq_trimmed/0053_P2017BB3S20R_S2_.1_val_1.fq.gz, fastq_trimmed/0053_P2017BB3S20R_S2_.2_val_2.fq.gz, qc/trim_galore/0053_P2017BB3S20R_S2_.1.fastq.gz_trimming_report.txt, qc/trim_galore/0053_P2017BB3S20R_S2_.2.fastq.gz_trimming_report..txt
conda-env: /home/nlv24077/projects/snaqs/snakemake_envs/1b94430b
cluster_jobid: 11904662.osios
Error executing rule trim_galore_pe on cluster (jobid: 7, external: 11904662.osios, jobscript: /home/nlv24077/temp/RNASeqPairedEnd/.snakemake/tmp.ooypuklc/snakejob.trim_galore_pe.7.sh).
For error details see the cluster log and the log files of the involved rule(s).


This is understandeble because the file fastq_trimmed/0053_P2017BB3S20R_S2_.1_val_1.fq.gz does not exist, it should be fastq_trimmed/0053_P2017BB3S20R_S2.1_val_1.fq.gz, somehow Snakemake added an extra underscore, which you can see in the execution output, under “wildcards” (0053_P2017BB3S20R_S2_).

[Fri Feb 21 12:39:57 2020]
rule trim_galore_pe:
input: fastq_raw/0053_P2017BB3S20R_S2_R1_001.fastq.gz, fastq_raw/0053_P2017BB3S20R_S2_R2_001.fastq.gz
output: fastq_trimmed/0053_P2017BB3S20R_S2_.1_val_1.fq.gz, fastq_trimmed/0053_P2017BB3S20R_S2_.2_val_2.fq.gz, qc/trim_galore/0053_P2017BB3S20R_S2_.1.fastq.gz_trimming_report.txt, qc/trim_galore/0053_P2017BB3S20R_S2_.2.fastq.gz_trimming_report..txt
jobid: 7
wildcards: sample=0053_P2017BB3S20R_S2_


So how can I prevent Snakemake from changing the samples and read wildcard in seemingly arbitrary ways?

Placing two wildcards directly behind each other will make it impossible for snakemake to determine where one wildcard ends and the other begins. So it is simply taking the longest possible match for the first wildcard (sample) and then one character for the second (read). This will also cause the _ to be included in the wildcard value sample.

I'm assuming that this same pattern is used for the inputs of another rule to refer to the trim_galore_pe output. Hence why the _ is included in the wildcards there.

You might be able to fix this, simply by adding a _ in-between {sample} and {readgroup} in your first rule (and any other rules using this same pattern):

rule fastqc:
input:
output:
wrapper:
"0.49.0/bio/fastqc"


Answered by DavyCats on August 28, 2021

I think you want to prevent snakemake from interpreting your wildcard strings as regular expressions. For this, escape each wildcard value using wildcard_constraints. E.g., towards the top of your Snakefile add:

wildcard_constraints:
sample= '|'.join([re.escape(x) for x in samples])


(Personally, I use the wildcard_constraints directive as above quite liberally to constrain every wildcard)

Answered by dariober on August 28, 2021

## Related Questions

### Unable to open .bam file in C++ using SeqAn due to ‘seqan::UnknownExtensionError’

3  Asked on August 10, 2021

### How to export summary data from Bowtie2 for MultiQC to read?

1  Asked on August 7, 2021

### What are some ways to error correct Oxford Nanopore long read sequencing?

1  Asked on August 3, 2021

### BioPython – Retrieve sequence records from pubmed database

0  Asked on August 3, 2021

### How to plot a PCOA biplot with OTU loadings as arrows

0  Asked on August 2, 2021 by gal-t

### Nonbonded interactions on the GPU

1  Asked on July 31, 2021

### CIBERSORT runtime error eval failed

1  Asked on July 28, 2021 by dr_hope

### How can I get bedtools to tell me which genes are being expressed?

1  Asked on July 27, 2021

### Receptor-ligand database

3  Asked on July 23, 2021

### Standard Way to Preprocess Gene Expression?

1  Asked on July 23, 2021 by wedgeantilles

### Is there a way to measure cell line similarity using python?

1  Asked on July 22, 2021 by yousef

### How to convert a SDF molecular structure file to a pandas dataframe with Python?

1  Asked on July 21, 2021

### cmapPy exception

1  Asked on July 20, 2021 by blue-bells

2  Asked on July 19, 2021 by c-zeil

### Understanding “Measurement of mRNA abundance using RNA-seq data” by Wagner et al 2012

1  Asked on July 19, 2021

### Understanding ViennaRNA RNAdistance scoring table

1  Asked on July 18, 2021

### Viral genome finishing

1  Asked on July 17, 2021

### Installing HiC-Pro issues

2  Asked on July 16, 2021

### “IndexError: list index out of range” in snakemake

0  Asked on July 15, 2021 by vitaly

### How to remove sequences from a fasta file using a sequence ID list which contains a space within the id?

2  Asked on July 15, 2021

Get help from others!