Standard Way to Preprocess Gene Expression?

Question

I am trying to collect gene expression data for the point of fitting gene regulatory networks. My background is primarily in computer science and so I am finding the biological literature a bit difficult to penetrate.
Specifically, I want to fit two regulatory networks for several cancers in TCGA from tumor and normal samples. I have queries using the R package TCGABiolinks and was able to follow the tutorial here up to the normalization part.
So I have three questions;

Is there a standard way to normalize gene expression counts? If so, what is it, and is it available as a library in R, python or some other language?
If not, can I just center and scale (z-transform) the samples? I am not interested in differential expression, only quantifying expression from counts.

Bastian Schiffthaler · Answer

First off - do not use FPKM for across-sample normalization. This is such a common issue that I have a prepared text for it:

Why not to use FPKM

They have a poor sensitivity when looking at low expressed genes [1, 2]. Short transcripts/genes may have falsely over-estimated expression values [5].
They may not normalise adequately across samples [3]. The reason for this is: FPKM are normalised across libraries using a single variable, namely the “total number of reads”. This library size factor estimation can easily be biased by highly expressed, highly variable genes, unlike the method used by the other approaches, which model the library size factor based on summary statistics extracted from all genes. In other words as stated in [4]: “The median ratio method is usually quite robust. The biggest problem for library size correction is not actually the genes with low counts, but the genes with very high counts, which have high variance and disproportionately influence estimators like the total sum. This is why the total sum is a notoriously bad estimator for library size.” Or the same by Lior’s Pachter himself [6].
They have a high False Positive Rate in differential expression analysis [1, 2]. A lot of the genes/transcripts deemed to be differentially expressed are not, mostly due to the effects of point 1), i.e. high artefactual variability in the expression and point 2) i.e. inadequate library size estimation.

On the other hand, approaches such as DESeq, DESeq2, vst or voom have adequate between-sample normalisation, have a controlled FDR and can reliably determine differential expression even for lowly expressed genes if there is enough replication or if an adequate log2 fold change cutoff is selected [7]. If you really insist on using cufflinks, consider performing upper quartile normalisation, which corrects for the point 1) issue.

References
[1] Dillies M.A., Briefings in Bioinf. 2012
[2] Soneson & Delorenzi, BMC Bioinf. 2013
[3] Lior Pachter’s lecture at CSHL (Cold Spring Harbour Laboratory) from minutes 32 on: https://www.youtube.com/watch?v=5NiFibnbE8o
[4] From Answer: "Over-correction" in the size-factors of the DESeq2 package:
[5] http://www.biomedcentral.com/1471-2105/14/370
[6] https://liorpachter.wordpress.com/2014/04/30/estimating-number-of-transcripts-from-rna-seq-measurements-and-why-i-believe-in-paywall/
[7] http://rnajournal.cshlp.org/content/22/6/839

What I recommend, is that you get the raw count data from TCGA, create a count matrix using the DESeq2 package in R, use the function varianceStabilizingTransformation() from DESeq2 to further process your counts and model the networks from there. Your pipeline already mentions voom, which is another good alternative, as is limma. All of these are available on Bioconductor.

Standard Way to Preprocess Gene Expression?

One Answer

Add your own answers!

Ask a Question