TransWikia.com

Why use "robust" FPKMs?

Bioinformatics Asked on February 3, 2021

Both DESeq2 and edgeR have an FPKM/RPKM function that by default uses normalized library sizes ("robust" option in DESeq2). FPKMs have their own issues, but I thought the main benefit was to have comparable units across independent experiments. If you are adding an experiment-specific normalization factor, then why even use FPKMs?

One Answer

FPKM are inherently experiment specific and can not be used to compare across samples. Let's consider the following two sequencing runs. Let $E1$ and $E2$ be the true, underlying expression in two samples of genes 1-6. Let $S1$ and $S2$ be the observed expression in our sequencing.

$$ begin{matrix} Gene & E1 & S1 & E2 & S2 G1 & 100 & 10 & 100 & 20 G2 & 100 & 10 & 100 & 20 G3 & 100 & 10 & 100 & 20 G4 & 100 & 10 & 0 & 0 G5 & 100 & 10 & 0 & 0 G6 & 100 & 10 & 0 & 0 end{matrix} $$

Our totals are 60 counts for $S1$ and 60 for $S2$. We sequenced both libraries to the same depth, but as there are fewer genes expressed in $E2$, we just naturally capture the genes that are expressed more often. For simplicity's sake, let's assume the genes all have the same length of 1, so we ignore the "K" in FPKM. Let's also forget the scaling by a million, since that's just to make nice numbers:

$$ begin{matrix} Gene & E1 & S1 & E2 & S2 G1 & 100 & 0.167 & 100 & 0.333 G2 & 100 & 0.167 & 100 & 0.333 G3 & 100 & 0.167 & 100 & 0.333 G4 & 100 & 0.167 & 0 & 0 G5 & 100 & 0.167 & 0 & 0 G6 & 100 & 0.167 & 0 & 0 end{matrix} $$

Now let's calculate size factors. In DESeq2, we create a pseudo-sample which is the geometric mean of all counts of each gene, so $sqrt{10 cdot 20} = 14.14$ for genes 1-3 and genes 3-6 are ignored as they have a member with 0-valued counts. The size factor for each library is the median ratio for each library compared to that pseudo-sample:

$$ med([frac{10}{14.14}, frac{10}{14.14}, frac{10}{14.14}) = 0.71 $$

$$ med([frac{20}{14.14}, frac{20}{14.14}, frac{20}{14.14}]) = 1.41 $$

So where do we end up if we divide by the size factors?

$$ begin{matrix} Gene & E1 & S1 & E2 & S2 G1 & 100 & 141 & 100 & 142 G2 & 100 & 141 & 100 & 142 G3 & 100 & 141 & 100 & 142 G4 & 100 & 141 & 0 & 0 G5 & 100 & 141 & 0 & 0 G6 & 100 & 141 & 0 & 0 end{matrix} $$

So in this simple example, size factors produce a better estimate than FPKM or TPM to compare values across experiments, hence the "robustness". Please note that this is still not great for any statistical testing, and in general should be used with caution anywhere.

Finally, I'll leave some more reading material concerning FPKMs and cross-sample comparisons here:

Answered by Bastian Schiffthaler on February 3, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP