Normalize RNA seq data from multiple runs for expression analysis

Question

I have RNA samples sequenced with TruSeq Stranded Total RNA kit protocol in Illumina HiSeq (2x125bp) and NovaSeq platforms (2x150bp) - almost 100 samples altogether. I have to use the samples data for expression analysis. Question is how do we normalize samples from different runs and of read length? What is best way (with available packages) we can normalize the count (to overcome the bias) and do a DE analysis? ANy suggestion would be helpful.

swbarnes2 · Answer

I don't think RNASEq is that sensitive to running different things on different instruments. Differing read lengths will change things. You could always trim the reads down so they all match. Or include read length as a factor in the design when looking for DE genes.

Answered by swbarnes2 on June 25, 2021

Roman Luštrik · Answer

There are a number of methods. If you're doing DE, you have ComBatSeq, SVAseq, RUVseq, BUSseq. You could also try Z-score normalization ($(x - overline{x}) / sigma$) or even quantile transformation. For the latter two, make sure you work on each batch individually, not the whole dataset at once. See more on that here.
To visually compare if transformation works, plotting PCA/UMAP/t-SNE on raw and transformed data can perhaps be of some insight.

JRodrigoF · Answer

Normalization and batch-effect correction are two different things.
You always have to do normalization of RNAseq data in order to be able to compare the runs and do DE analysis, for instance. Plenty of methods and you have to search for them as normalization methods indeed.
What you are asking and concerned about is "batch-effects" in your data as a function of having used two different Illumina machines. This is a valid concern, but before you try to correct for it or mitigate it, you first have to assess whether it is actually there. Maybe is not at all. To figure out if you have batch-effects you have to first do data exploration, this is why the use of PCA/UMAP/t-SNE has been already suggested. In this plots you will be trying to see whether your data cluster together by machine type instead of by the biology it represents. Read about how to assess the presence of batch effects.
If you do observe it or anyway want to correct for its presence, then indeed you can try one of the methods also already mentioned: ComBatSeq, SVAseq, RUVseq, BUSseq.
Rather than the two different machines, the batch-effect might actually be due to different people doing the lib preparation, experimental settings if experiments were done at different moments, etc.

Fabio Marroni · Answer

I agree with swbarnes that you don't need to explicitly model for different read length, but you need to normalize for different production of sequencing runs.
Among the several available packages, DEseq2 is one of the most widely used.
The normalization method computes the geometric mean of expression of each gene across samples; then, expression of the gene in each sample is divided by the geometric mean. Once this has been don for all genes, in each sample it computes the median of values, and this is the "size factor". Ok, it looks convoluted, but it is indeed easy, and it is implemented in DESeq2 and in other software packages.
You can find methodological details here.
Finally, DESeq2 also offer the possibility to account for additional experimental details (e.g. read length) and this should reduce the risk of spurious results.

ATpoint · Answer

If you have different read length then trim all samples to the shortest length across batches, that will avoid mappability bias. Including it into the DE design is possible but as you can completely eliminate this bias in silico by trimming this is what I'd do.

Sequencing on different machines is usually not much of a confounding factor, but the fact that two machines were involved suggest different batches of experiments. I hope that batch is not confounded by group so each batch should contain samples of all groups to avoid a general batch effect that cannot be corrected for. Library prep should (in a well-designed experiment) be fully identical.

Use any of the established methods, be it RLE or TMM (they usually perform ery similar), and then perform PCA based on the top variable log2-transformed counts to explore whether there are batch effects that are worth correcting, so whether you see clustering or differences based on batch rather than experimental design in the early PCs. If so (given it is not confounded by experimental groups) you can include it into the DE design as a covariate.

As said by others, normalization and batch correction are independent processes. Also as I said in on top, you can eliminate read length bias (even though I doubt it is a major factor here) in silico which is what I'd recommend so you do not have to bother with it downstream.

Normalize RNA seq data from multiple runs for expression analysis

5 Answers

Add your own answers!

Ask a Question