AnswerBun.com

Prevent Snakemake from changing wildcards: sample for every rule

Bioinformatics Asked by Freek on August 28, 2021

EDIT: I posted this as a bug, perhaps wait for the result of that? github

All,

I am using Snakemake to build a pipeline. To define input and output names, I constructed them based on a filename base (from the samples table), a read indicator (like R1, R2) and the fastq file extension, like so:

samples = pd.read_csv('samples.tsv', sep='t').set_index('samples', drop=False)
reads = ['R1', 'R2']
fastq_file_extension = _001.fastq.gz
rule fastqc:
    input:
        os.path.join(config['Directories']['fastq_raw_dir'], '{sample}{read}' + fastq_file_extension)
    output:
        html=os.path.join(config['Directories']['fastqc_dir'], '{sample}{read}_fastqc.html'),
        zip=os.path.join(config['Directories']['fastqc_dir'], '{sample}{read}_fastqc.zip')
    wrapper:
        "0.49.0/bio/fastqc"

However, during pipeline execution already I can see strange things happening, for this rule it reports:

[Fri Feb 21 12:39:57 2020]
rule fastqc:
input: fastq_raw/0053_P2017BB3S20R_S2_R1_001.fastq.gz
output: qc/fastqc/0053_P2017BB3S20R_S2_R1_fastqc.html, qc/fastqc/0053_P2017BB3S20R_S2_R1_fastqc.zip
jobid: 10
wildcards: sample=0053_P2017BB3S20R_S2_R, read=1

As you can see, Snakemake pulled the R part of the read to the sample and set the read to 1 (should be R1). But this still works because it produces output files as Snakemake expects.

But the next rule, trim_galore_pe:

rule trim_galore_pe:
    input:
        [os.path.join(config['Directories']['fastq_raw_dir'], '{sample}' + read + fastq_file_extension) for read in reads]
    output:
        fastq_read1 = os.path.join(config['Directories']['fastq_trimmed_dir'], '{sample}.1_val_1.fq.gz'),
        fastq_read2 = os.path.join(config['Directories']['fastq_trimmed_dir'], '{sample}.2_val_2.fq.gz'),
        report_read1 = os.path.join(config['Directories']['trim_galore_qc_dir'], '{sample}.1.fastq.gz_trimming_report.txt'),
        report_read2 = os.path.join(config['Directories']['trim_galore_qc_dir'], '{sample}.2.fastq.gz_trimming_report.txt')
    params:
        extra="--retain_unpaired"
    wrapper:
        "0.49.0/bio/trim_galore/pe"

This goes wrong, during pipeline execution Snakemake reports:

[Fri Feb 21 12:40:37 2020]
Error in rule trim_galore_pe:
jobid: 7
output: fastq_trimmed/0053_P2017BB3S20R_S2_.1_val_1.fq.gz, fastq_trimmed/0053_P2017BB3S20R_S2_.2_val_2.fq.gz, qc/trim_galore/0053_P2017BB3S20R_S2_.1.fastq.gz_trimming_report.txt, qc/trim_galore/0053_P2017BB3S20R_S2_.2.fastq.gz_trimming_report..txt
conda-env: /home/nlv24077/projects/snaqs/snakemake_envs/1b94430b
cluster_jobid: 11904662.osios
Error executing rule trim_galore_pe on cluster (jobid: 7, external: 11904662.osios, jobscript: /home/nlv24077/temp/RNASeqPairedEnd/.snakemake/tmp.ooypuklc/snakejob.trim_galore_pe.7.sh).
For error details see the cluster log and the log files of the involved rule(s).

This is understandeble because the file fastq_trimmed/0053_P2017BB3S20R_S2_.1_val_1.fq.gz does not exist, it should be fastq_trimmed/0053_P2017BB3S20R_S2.1_val_1.fq.gz, somehow Snakemake added an extra underscore, which you can see in the execution output, under “wildcards” (0053_P2017BB3S20R_S2_).

[Fri Feb 21 12:39:57 2020]
rule trim_galore_pe:
input: fastq_raw/0053_P2017BB3S20R_S2_R1_001.fastq.gz, fastq_raw/0053_P2017BB3S20R_S2_R2_001.fastq.gz
output: fastq_trimmed/0053_P2017BB3S20R_S2_.1_val_1.fq.gz, fastq_trimmed/0053_P2017BB3S20R_S2_.2_val_2.fq.gz, qc/trim_galore/0053_P2017BB3S20R_S2_.1.fastq.gz_trimming_report.txt, qc/trim_galore/0053_P2017BB3S20R_S2_.2.fastq.gz_trimming_report..txt
jobid: 7
wildcards: sample=0053_P2017BB3S20R_S2_

So how can I prevent Snakemake from changing the samples and read wildcard in seemingly arbitrary ways?

2 Answers

Placing two wildcards directly behind each other will make it impossible for snakemake to determine where one wildcard ends and the other begins. So it is simply taking the longest possible match for the first wildcard (sample) and then one character for the second (read). This will also cause the _ to be included in the wildcard value sample.

I'm assuming that this same pattern is used for the inputs of another rule to refer to the trim_galore_pe output. Hence why the _ is included in the wildcards there.

You might be able to fix this, simply by adding a _ in-between {sample} and {readgroup} in your first rule (and any other rules using this same pattern):

rule fastqc:
    input:
        os.path.join(config['Directories']['fastq_raw_dir'], '{sample}_{read}' + fastq_file_extension)
    output:
        html=os.path.join(config['Directories']['fastqc_dir'], '{sample}{read}_fastqc.html'),
        zip=os.path.join(config['Directories']['fastqc_dir'], '{sample}{read}_fastqc.zip')
    wrapper:
        "0.49.0/bio/fastqc"

Answered by DavyCats on August 28, 2021

I think you want to prevent snakemake from interpreting your wildcard strings as regular expressions. For this, escape each wildcard value using wildcard_constraints. E.g., towards the top of your Snakefile add:

wildcard_constraints:
    read= '|'.join([re.escape(x) for x in reads]),
    sample= '|'.join([re.escape(x) for x in samples])

(Personally, I use the wildcard_constraints directive as above quite liberally to constrain every wildcard)

Answered by dariober on August 28, 2021

Add your own answers!

Related Questions

How to plot a PCOA biplot with OTU loadings as arrows

0  Asked on August 2, 2021 by gal-t

   

CIBERSORT runtime error eval failed

1  Asked on July 28, 2021 by dr_hope

   

Standard Way to Preprocess Gene Expression?

1  Asked on July 23, 2021 by wedgeantilles

 

cmapPy exception

1  Asked on July 20, 2021 by blue-bells

     

Timeout when downloading the ncbi nr blast database

2  Asked on July 19, 2021 by c-zeil

   

Viral genome finishing

1  Asked on July 17, 2021

       

Ask a Question

Get help from others!

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir