WGBS vs ONT, which one should I trust?

Question

First of all let me apologize if this is not the place for this question. While it can be quite broad, I will try to make it as specific as I can.
Context
I am testing the correlation between Whole Genome Bisulfite Sequencing and Oxford Nanopore methylation calls for one individual (mammal, blood tissue) with replicates at all CpG sites. In the (very) best cases I find a close to 0.9 correlation. Yet, for the part that does not correlate I would like to know which technology is closer to the "biological truth".
My strategy
My plan would be to first find CpG sites/short CpG regions showing differential methylation states between WGBS and ONT. Then I would validate the methylation states of these sites using a third party technology such as pyromark.
What I am asking to the SE community

First, has anyone done this before ? I would love to hear about your experience

Do you see some critical aspects I should be watching out for ?

Last if you have comments about the strategy I propose here, you are welcome

gringer · Answer

Methylation can be cell-specific, which makes it difficult to evaluate accuracy on a bulk-cell level (even within the same tissue). How can you tell that the differences you're seeing are due to platform differences, or due to biological variation?
I find that adding more haystacks doesn't help much in working out the truth of a dataset. If you want to investigate biological truth, it would be better to create a biological system with a known methylation (or demethylation) pattern and test that.

M__ · Answer

I would assess directionality and accuracy of prediction by 1) WGBS predicting ONT and then 2) ONT predicting WGBS.
Firstly, I would use deep learning (or machine learning) and train WGBS against ONT, parameterise and then test. Then conversely train ONT against WGBS, parameterise and then test. The approach with highest accuracy of prediction (using the 'accuracy' index) would be the approach assumed to be "closer to the biological truth". If both calculations produced comparable accuracies I would conclude what @gringer has stated that natural variation / natural heterogeneity is the predominant signal in the signal. This conclusion depends on the ability of deep learning to map the biological process.
It is an approach highly applicable to a deep learning estimate regardless and may not replace a truely controlled test, this will depend on what controls you have run, but could provide a valuable and easily obtainable insight in its absence.
If you did run a control sample, then you are truely in, because this would provide the training set and the WBGS and ONT provide the test data.

You asked me how it works
It doesn't replace good controls, but it is 'trendy workaround'.
Two steps

In deep learning (artificial neural networks, ANN, in this case you would use RNN such as LSTM) or machine learning (ML) e.g. random forest - random forests are MUCH easier to do, you supply a series of known values, thus CpG sites and their WGBS scores and use ANN (RNN approach) or ML (e.g. random forest). Both methods will look for pattens between the observed DNA data and the CpG occurence (training set). Usually this is 60% of your data set is used in training. A separate 20% of the data set is often used for parameterisation and a final 20% of the data set is used for testing. Thus how well can a ANN or ML predict a CpG site based on WGBS data. Likewise you would do the same for ONT. There is a caveat here which I discuss below.

The idea is if you give it a bit of DNA it will assign whether it is CpG (see caveat below).

However, instead of testing WGBS to WGBS you also test it WGBS to ONT and ONT to WGBS and the method with the highest 'accuracy' (positive hits, against true hits) is 'better method'. This is because you already know which is ONT positive and ask a WGBS trained algorithm whether it thinks its positive. The relationship between observed CpG and 'true (experimental) CpG is your accuracy score.

The caveat is that you need to supply the training set with sites that are known to be negative under WGBS or ONT. You need to think the best strategy for doing this. If some of those negatives were positve in the other method that would affect the outcome alot.
There is an issue about vectorisation this is often done using k-mers, and is a headache. Basically you have to make a bit of DNA into numbers and that in my opinion is the difficult part. In your case I ain't so sure its so difficult. If you can get that bit correct and biologically meaningful you are in business. My stuff is phylogeny based and I understand the relationship between biology (mutation) and numbers. In your case that needs thought, however someone will have thought about it and solved it.
As a personal statement I would do this via random forests, because ANN is difficult.
One such vectorisation method is here by Li et al (2017), however as they have used it, the method has flaws due to equal weighting of mutation frequencies. However, these flaws do not apply to your work - it would actually be a good method, they singly apply to tree building.

The final thing is that the more experimental data you can feed it the better. You don't have to complicated about it, just create a new column in the input section, e.g. along side teh vectorised DNA and the algorithm will figure out whether it helps. You can't feed it enough data, because it is doesn't help it wouldn't use it.

WGBS vs ONT, which one should I trust?

2 Answers

Add your own answers!

Ask a Question