Block wise protein imputation

Question

I am currently working on a dataset that contains 50 samples (10 samples * 5 blocks). The features of the date set are:

The data is perfectly balanced between blocks, with equal treatment representation in each block. Each block contains 2 control samples (CTL) that are pools of all other treatment samples. These are technical replicates that are identical across multiplexes.

There is a lot of missing data (~40%),

most of the missing data is missing by block, meaning that the protein is present in some blocks, but not others.

When looking at the expression of these missing proteins in the other blocks, they tend to be around the median value for the experiment, which suggests to me that these values are missing completely at random.

I could proceed using complete cases, but then I'm missing out on a large portion of data. I am considering using an imputation method, but I'm concerned that (due to the nature of the missing data) that I will just be introducing a block effect into the data.
Any input on potential methods for imputation would be greatly appreciated.
Experimental design:
SampleNumber    Multiplex   Treatment
1   A   CTL
2   A   CTL
3   A   TMT1
4   A   TMT1
5   A   TMT2
6   A   TMT2
7   A   TMT3
8   A   TMT3
9   A   TMT4
10  A   TMT4
11  B   CTL
12  B   CTL
13  B   TMT1
14  B   TMT1
15  B   TMT2
16  B   TMT2
17  B   TMT3
18  B   TMT3
19  B   TMT4
20  B   TMT4
21  C   CTL
22  C   CTL
23  C   TMT1
24  C   TMT1
25  C   TMT2
26  C   TMT2
27  C   TMT3
28  C   TMT3
29  C   TMT4
30  C   TMT4
31  D   CTL
32  D   CTL
33  D   TMT1
34  D   TMT1
35  D   TMT2
36  D   TMT2
37  D   TMT3
38  D   TMT3
39  D   TMT4
40  D   TMT4
41  E   CTL
42  E   CTL
43  E   TMT1
44  E   TMT1
45  E   TMT2
46  E   TMT2
47  E   TMT3
48  E   TMT3
49  E   TMT4
50  E   TMT4

M__ · Answer

40% missing data is huge. Missing data analysis is complicated on the underlying distribution. If the data set is periodic then missing data periodicity is needed. Non-periodic data can be solved using regression, I think SciPy has an automated method for this, but the volume of missing data makes this approach complex. If you can bring the missing data down to 20% by compartmenatlising the data set that would be the starting point.

Answered by M__ on March 23, 2021

Will Fondrie · Answer

This looks like a TMT proteomics experiment. What tools/pipeline did you use to arrive at your quantitative values?
There are a variety of tools that are useful for analyzing this type of data and supply appropriate imputation methods. If you are comfortable programming in R, one I would suggest is the MSstatsTMT R package: http://msstats.org/msstatstmt/
That being said, any ways you can improve the upstream analysis to decrease the amount of missing values will be extremely beneficial.

Block wise protein imputation

2 Answers

Add your own answers!

Ask a Question