Determining significance of a variable in a glm model

Question

How would one determine the significance of a variable in a glm model?
If I, for example, have a dataframe like seen below, how would I determine if the origin of the sample has a significant effect on the value? (this is the number of enzymes capable of degrading the substrate f that matters)
Substrate    variable value origin
cellulose       M09    8    free
mannan          M12    2    free
glycogen        M65    2    free
chitin          M87    4    free
cellulose       M90    2    isolate
manan           M78    1    isolate
glycogen        M21    4    isolate
chitin          M21    1    isolate

So far I have tried:
mcomp = glm.nb(value ~ origin, data = my_data)

summary(mcomp)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9625  -0.9047  -0.9047   0.1212   3.5232

Coefficients:
              Estimate Std. Error z value Pr(>|z|)   
(Intercept)   -0.01657    0.06571  -0.252  0.80097   
originisolate -0.21911    0.08180  -2.679  0.00739 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.3418) family taken to be 1)

Null deviance: 2053.5  on 2679  degrees of freedom
Residual deviance: 2046.3  on 2678  degrees of freedom
AIC: 6517.5

Number of Fisher Scoring iterations: 1

Theta:  0.3418 
          Std. Err.:  0.0186

2 x log-likelihood:  -6511.4590

So free becomes the intercept and then isolate if significantly different from that. Does this mean Origin has a significant effect on the value?
Would the better approach be to do the following?:
mcomp = glm.nb(value ~ origin + Substrate, data = comb_data) 
summary(aov(mcomp))
              Df Sum Sq Mean Sq F value Pr(>F)    
origin         1     23   22.55   6.612 0.0102 *  
Substrate     44   1445   32.84   9.631 <2e-16 ***
Residuals   2634   8981    3.41                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’

This shows me that origin and substrate have an effect on value if I understand correctly?

StupidWolf · Answer

There is no better method, it's a matter of what you want to test or what is your question.
Using the anova() or aov(), test the terms collectively. For example, in your example with Substrate, the null hypothesis is that the coefficients are all zero, meaning cellulose =0, mannan =0 , ....
If the question is, "do the isolate samples have a higher value than origin samples?", then you can use your first model, where free is set as the reference and you test whether the effect of isolate is non-zero. Likewise you can do this for substrate and set of them as your reference. You can also do other pairwise comparisons using this model.
If the question is, "does origin have a significant effect on value, after controlling for substrate?", then you can use your second model.

M__ · Answer

Second viewing of the question from what I can see -0.22 as a coefficient of origin is a strong negative association, so yeah it has a major impact. Its not how I would have done it, but that looks to be the result.

First viewing,
I'm going to throw my hat in here. We don't know what 'origin' is about, anyway just throw everything, i.e. each substrate and the origin into the same regression calculation. Check for a low-residual and preferably do a Q-Q plot, transform your data it this doesn't look good.
The key and the thing you are missing is your regression weights, without that I couldn't say very much. If the regression weight is near zero for 'origin' then it has zero impact. If the regression weight of 'origin' is positively greater than everything else ... I assume there are skewed distributions of 'substrates' between the 'origins'. If the regression weight of 'origin' is negative but still greater than all other regression weights then it is adversely affecting the 'value' you are seeking.
I don't know the experiment, the biological system or really the 'substrate' assays, so I can't comment any further.
The two issues I have are:

Doing an ANOVA on the output of a regression analysis doesn't make much sense to me. It is not something I would do, nor something in ML or GLM I've encountered.
Are you doing pairwise substrate/origin calculations? I presue not, but just in case this not how GLM works.

Determining significance of a variable in a glm model

2 Answers

Add your own answers!

Ask a Question