TransWikia.com

too much levels in the categorical variable in a GLM

Cross Validated Asked on December 15, 2021

I have 187 observations, the categorical variable is a predictor. My response variable is CPUE (catch per unit of effort). My goal is to know which of these variables (temperature, chlorophyll, depth, and bottom type) are most important for the capture of a specific species that I am analyzing.
But I am struggling with this result where it appears that the null model is the most parsimonious. So I was wondering if is there a problem in adding a variable with so many levels to the model and why this (+) symbol is in the output always when the categorical variable appears. What does it mean?
It also seems strange to me that no model with the categorical variable was never selected. This intuition of mine is also based on the response of a regression tree that I ran with this data and it appeared that the most explanatory variable was precisely the categorical variable that does not seem to have any relevance in the glm.

The categorical variable within the model it is a factor with 22 levels. I have searched and seen that the suggestion is to transform the levels into dummy variable but I don’t think it is the way out once I would have to create 21 more columns and insert in the model…
OBS: I already checked and the variable is a factor and not numeric.

mod0 <- glm(nCPUE ~ 1, data = bonaci, family=gaussian) #modelo nulo
mod1 <- glm(nCPUE ~ Depth, data = bonaci, family=gaussian) #depth
mod2 <- glm(nCPUE ~ Chlorophyll, data = bonaci, family=gaussian) #chlorophyll
mod3 <- glm(nCPUE ~ BottomType, data = bonaci, family=gaussian) #bottom type
mod4 <- glm(nCPUE ~ SST, data = bonaci, family=gaussian) #temperature
mod5 <- glm(nCPUE ~ Depth + Chlorophyll, data = bonaci, family=gaussian) 
mod6 <- glm(nCPUE ~ Depth + BottomType, data = bonaci, family=gaussian)
mod7 <- glm(nCPUE ~ Depth + SST, data = bonaci, family=gaussian)
mod8 <- glm(nCPUE ~ Chlorophyll + BottomType, data = bonaci, family=gaussian)
mod9 <- glm(nCPUE ~ Chlorophyll + SST, data = bonaci, family=gaussian)
mod10 <- glm(nCPUE ~ BottomType + SST, data = bonaci, family=gaussian)
mod11 <- glm(nCPUE ~ Depth * Chlorophyll, data = bonaci, family=gaussian)
mod12 <- glm(nCPUE ~ Depth * BottomType, data = bonaci, family=gaussian, na.action = "na.fail")
mod13 <- glm(nCPUE ~ Depth * SST, data = bonaci, family=gaussian)
mod14 <- glm(nCPUE ~ Chlorophyll * BottomType, data = bonaci, family=gaussian, na.action = "na.fail")
mod15 <- glm(nCPUE ~ Chlorophyll * SST, data = bonaci, family=gaussian)
mod16 <- glm(nCPUE ~ BottomType * SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod17 <- glm(nCPUE ~ Depth + Chlorophyll + BottomType, data = bonaci, family=gaussian, na.action = "na.fail")
mod18 <- glm(nCPUE ~ Depth + Chlorophyll + SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod19 <- glm(nCPUE ~ Chlorophyll + BottomType + SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod20 <- glm(nCPUE ~ Depth * Chlorophyll * BottomType, data = bonaci, family=gaussian, na.action = "na.fail")
mod21 <- glm(nCPUE ~ Depth * Chlorophyll * SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod22 <- glm(nCPUE ~ Chlorophyll * BottomType * SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod23 <- glm(nCPUE ~ Depth + Chlorophyll + BottomType + SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod24 <- glm(nCPUE ~ Depth * Chlorophyll * BottomType * SST, data = bonaci, family=gaussian, na.action = "na.fail")

library(MuMIn)
out.put <-model.sel (mod0, mod1, mod2, mod3, mod4, mod5, mod6, mod7, mod8, mod9, mod10, mod11, mod12, mod13, mod14, mod15, mod16, mod17, mod18, mod19, mod20, mod21, mod22, mod23, mod24)

out.put
    
    library(MuMIn)
    out.put <-model.sel (mod0, mod1, mod2, mod3, mod4, mod5, mod6, mod7, mod8, mod9, mod10, mod11, mod12, mod13, mod14, mod15, mod16, mod17, mod18, mod19, mod20, mod21, mod22, mod23, mod24)
    out.put

enter image description here

One Answer

If you have a categorical independent variable with N levels in a regression then you will need to have N-1 variables to represent it in the model. R (and other software) can do this for you and there are various ways to do it (e.g. dummy coding, effect coding, Helmert coding and so on) but you will need N - 1 variables.

Questions about coding are off topic here. You can consult the package documentation to see what the + signs are. Model selection is tricky and can easily lead to mistaken output. I haven't researched what MuMin does, but be wary and do some research on this.

As to why a tree model can give very different results from a regression when there is a categorical variable with multiple levels, one reason is that a tree model looks at every possible split of the categorical variable while a regression model estimates parameters for each level. This means you need to be very careful with trees and prune the tree carefully. Good tree programs will help you do this.

Answered by Peter Flom on December 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP