Determining significance of a variable in a glm model

Bioinformatics Asked by RMM on September 30, 2021

How would one determine the significance of a variable in a glm model?

If I, for example, have a dataframe like seen below, how would I determine if the origin of the sample has a significant effect on the value? (this is the number of enzymes capable of degrading the substrate f that matters)

Substrate    variable value origin
cellulose       M09    8    free
mannan          M12    2    free
glycogen        M65    2    free
chitin          M87    4    free
cellulose       M90    2    isolate
manan           M78    1    isolate
glycogen        M21    4    isolate
chitin          M21    1    isolate

So far I have tried:

mcomp = glm.nb(value ~ origin, data = my_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9625  -0.9047  -0.9047   0.1212   3.5232  

              Estimate Std. Error z value Pr(>|z|)   
(Intercept)   -0.01657    0.06571  -0.252  0.80097   
originisolate -0.21911    0.08180  -2.679  0.00739 **
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.3418) family taken to be 1)

    Null deviance: 2053.5  on 2679  degrees of freedom
Residual deviance: 2046.3  on 2678  degrees of freedom
AIC: 6517.5

Number of Fisher Scoring iterations: 1

              Theta:  0.3418 
          Std. Err.:  0.0186 

 2 x log-likelihood:  -6511.4590 

So free becomes the intercept and then isolate if significantly different from that. Does this mean Origin has a significant effect on the value?

Would the better approach be to do the following?:

mcomp = glm.nb(value ~ origin + Substrate, data = comb_data) 
              Df Sum Sq Mean Sq F value Pr(>F)    
origin         1     23   22.55   6.612 0.0102 *  
Substrate     44   1445   32.84   9.631 <2e-16 ***
Residuals   2634   8981    3.41                   
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 

This shows me that origin and substrate have an effect on value if I understand correctly?

2 Answers

There is no better method, it's a matter of what you want to test or what is your question.

Using the anova() or aov(), test the terms collectively. For example, in your example with Substrate, the null hypothesis is that the coefficients are all zero, meaning cellulose =0, mannan =0 , ....

If the question is, "do the isolate samples have a higher value than origin samples?", then you can use your first model, where free is set as the reference and you test whether the effect of isolate is non-zero. Likewise you can do this for substrate and set of them as your reference. You can also do other pairwise comparisons using this model.

If the question is, "does origin have a significant effect on value, after controlling for substrate?", then you can use your second model.

Answered by StupidWolf on September 30, 2021

Second viewing of the question from what I can see -0.22 as a coefficient of origin is a strong negative association, so yeah it has a major impact. Its not how I would have done it, but that looks to be the result.

First viewing,

I'm going to throw my hat in here. We don't know what 'origin' is about, anyway just throw everything, i.e. each substrate and the origin into the same regression calculation. Check for a low-residual and preferably do a Q-Q plot, transform your data it this doesn't look good.

The key and the thing you are missing is your regression weights, without that I couldn't say very much. If the regression weight is near zero for 'origin' then it has zero impact. If the regression weight of 'origin' is positively greater than everything else ... I assume there are skewed distributions of 'substrates' between the 'origins'. If the regression weight of 'origin' is negative but still greater than all other regression weights then it is adversely affecting the 'value' you are seeking.

I don't know the experiment, the biological system or really the 'substrate' assays, so I can't comment any further.

The two issues I have are:

  1. Doing an ANOVA on the output of a regression analysis doesn't make much sense to me. It is not something I would do, nor something in ML or GLM I've encountered.
  2. Are you doing pairwise substrate/origin calculations? I presue not, but just in case this not how GLM works.

Answered by M__ on September 30, 2021

Add your own answers!

Related Questions

Block wise protein imputation

2  Asked on March 23, 2021 by whateversclever


RAD Seq Data Analysis without barcode

2  Asked on March 20, 2021 by biobash


FASTA and PDB: How to specify chain?

2  Asked on March 19, 2021 by lazer-guided-lazerbeam


How can I use my Myheritage DNA results file for further analysis?

1  Asked on March 19, 2021 by user3390486


Within and between sample count normalization

1  Asked on March 16, 2021 by maxno3


Too slow issue of BioMart

1  Asked on March 12, 2021 by user224050


Obtaining Whole Genetic Sequence

2  Asked on March 11, 2021


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir