# Determining significance of a variable in a glm model

Bioinformatics Asked by RMM on September 30, 2021

How would one determine the significance of a variable in a glm model?

If I, for example, have a dataframe like seen below, how would I determine if the origin of the sample has a significant effect on the value? (this is the number of enzymes capable of degrading the substrate f that matters)

Substrate    variable value origin
cellulose       M09    8    free
mannan          M12    2    free
glycogen        M65    2    free
chitin          M87    4    free
cellulose       M90    2    isolate
manan           M78    1    isolate
glycogen        M21    4    isolate
chitin          M21    1    isolate


So far I have tried:

mcomp = glm.nb(value ~ origin, data = my_data)

summary(mcomp)
Deviance Residuals:
Min       1Q   Median       3Q      Max
-0.9625  -0.9047  -0.9047   0.1212   3.5232

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.01657    0.06571  -0.252  0.80097
originisolate -0.21911    0.08180  -2.679  0.00739 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.3418) family taken to be 1)

Null deviance: 2053.5  on 2679  degrees of freedom
Residual deviance: 2046.3  on 2678  degrees of freedom
AIC: 6517.5

Number of Fisher Scoring iterations: 1

Theta:  0.3418
Std. Err.:  0.0186

2 x log-likelihood:  -6511.4590


So free becomes the intercept and then isolate if significantly different from that. Does this mean Origin has a significant effect on the value?

Would the better approach be to do the following?:

mcomp = glm.nb(value ~ origin + Substrate, data = comb_data)
summary(aov(mcomp))
Df Sum Sq Mean Sq F value Pr(>F)
origin         1     23   22.55   6.612 0.0102 *
Substrate     44   1445   32.84   9.631 <2e-16 ***
Residuals   2634   8981    3.41
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’


This shows me that origin and substrate have an effect on value if I understand correctly?

There is no better method, it's a matter of what you want to test or what is your question.

Using the anova() or aov(), test the terms collectively. For example, in your example with Substrate, the null hypothesis is that the coefficients are all zero, meaning cellulose =0, mannan =0 , ....

If the question is, "do the isolate samples have a higher value than origin samples?", then you can use your first model, where free is set as the reference and you test whether the effect of isolate is non-zero. Likewise you can do this for substrate and set of them as your reference. You can also do other pairwise comparisons using this model.

If the question is, "does origin have a significant effect on value, after controlling for substrate?", then you can use your second model.

Answered by StupidWolf on September 30, 2021

Second viewing of the question from what I can see -0.22 as a coefficient of origin is a strong negative association, so yeah it has a major impact. Its not how I would have done it, but that looks to be the result.

First viewing,

I'm going to throw my hat in here. We don't know what 'origin' is about, anyway just throw everything, i.e. each substrate and the origin into the same regression calculation. Check for a low-residual and preferably do a Q-Q plot, transform your data it this doesn't look good.

The key and the thing you are missing is your regression weights, without that I couldn't say very much. If the regression weight is near zero for 'origin' then it has zero impact. If the regression weight of 'origin' is positively greater than everything else ... I assume there are skewed distributions of 'substrates' between the 'origins'. If the regression weight of 'origin' is negative but still greater than all other regression weights then it is adversely affecting the 'value' you are seeking.

I don't know the experiment, the biological system or really the 'substrate' assays, so I can't comment any further.

The two issues I have are:

1. Doing an ANOVA on the output of a regression analysis doesn't make much sense to me. It is not something I would do, nor something in ML or GLM I've encountered.
2. Are you doing pairwise substrate/origin calculations? I presue not, but just in case this not how GLM works.

Answered by M__ on September 30, 2021

## Related Questions

### What is the typical host-to-bug DNA ratio found in nanopore sequencing without amplification?

1  Asked on March 24, 2021 by timd1

### Block wise protein imputation

2  Asked on March 23, 2021 by whateversclever

### Is there a database of protein sequences/structure along with their melting temperature?

1  Asked on March 22, 2021 by swa_mi

### Finding common and unique data set by comparing two files based on their column and to split the columns multiple strings to print in output

1  Asked on March 22, 2021 by nitha

### RAD Seq Data Analysis without barcode

2  Asked on March 20, 2021 by biobash

### >My counter is counting genotypic combination occurences more than once, how do I ensure it counts one combination and doesnt count it again?

1  Asked on March 20, 2021

### FASTA and PDB: How to specify chain?

2  Asked on March 19, 2021 by lazer-guided-lazerbeam

### Find rsIDs for GRCh37 SNPs and rsIDs for GRCh38 SNPs and compare for overlap

2  Asked on March 19, 2021 by celinedion

### How can I use my Myheritage DNA results file for further analysis?

1  Asked on March 19, 2021 by user3390486

1  Asked on March 17, 2021

### What is a PacBio “movie file”?

1  Asked on March 17, 2021

### Within and between sample count normalization

1  Asked on March 16, 2021 by maxno3

### How to perform KOG classification?

0  Asked on March 13, 2021 by mendel

### Is there a baby Hello World example using BIGstack?

0  Asked on March 13, 2021

### Finding simple sequence from reads with significant overlap

1  Asked on March 13, 2021 by ryan-ward

### Too slow issue of BioMart

1  Asked on March 12, 2021 by user224050

### Question about public availability of human SNP dataset with country of origin

0  Asked on March 12, 2021 by user257566

### Obtaining Whole Genetic Sequence

2  Asked on March 11, 2021

### Nvidia Parabrick fq2bam pipeline error – No such file or directory

1  Asked on March 11, 2021

### Does this shape has two clusters or one?

1  Asked on March 10, 2021