TransWikia.com

Intuition Behind binomial (logistic) GLM

Cross Validated Asked on December 15, 2021

This is a question regarding using logistic regression, and relating it to gaussian distribution or a binomial distribution.

model <- glm(target ~ x1, data=data, type='response', family='binomial')
model <- glm(target ~ x1, data=data, type='response')  #defaults to gaussian

My understanding of binomial is that it is

theta=chance of success
z=trails ending in success
k=trials ending in failure
(theta^z)*(1-theta)^k

And something Gaussian is

theta = standard deviation
x = success
u = mean
Y = [ 1/σ * sqrt(2π) ] * e -(x - μ)2/2σ2 

So I understand how to do GLM with R, I kind of understand what binomial and gaussian means, but I have no understanding of how you relate binomial or gaussian to logistic regression, and how binomial and gaussian are different in this context.

Question 1- Can someone explain the intuition behind how "family=’binomial’" is used when building a model with GLM?

Question 2- Given that the shapes of a binomial distribution and a Gaussian distribution look very much the same (they both peak in the middle and gradually go down towards the ends), how does choosing either binomial or Gaussian lead to different models built from GLM?

2 Answers

Lets say your response variable is $Y$. In regression we want to model our response variable as a linear combination of our predictor variables ($X$) e.g. $Y=beta_0 + beta_1X + epsilon$ or $E[Y]=beta_0 + beta_1X$. But what happens when our response variable is only in $[0,1]$ (i.e. it is a probability, proportion or strictly only 0 or 1). Notice that $beta_0 + beta_1X$ may take any value on the real line! It could be 0 or 1 or 100 or even negative! If our response variable is strictly in $[0,1]$ it makes no sense to try to use a model that can take values outside of that range.

Therefore, when we want to model a probability or a proportion, we instead model a function of $Y$. For example $g(E[Y])=beta_0 + beta_1X$. This function is called the link function.

$g(E[Y]) = E[Y]$ Identity link. Used in Linear regression

$g(P(Y=1)) = log{dfrac{P(Y=1)}{1-P(Y=1)}}$ Logit link, Used in logistic regression. Notice here we are modeling the probability $Y=1$ which is also the expected value. Then we can solve for what we want: $P(Y=1) = dfrac{e^{beta_0+beta_1X}}{1+e^{beta_0+beta_1X}}$

$g(E[Y]) = log{E[Y]}$ log link, used in poisson regression

Question 1: In the GLM R function, the family parameter allows you to specify the link function.

Question 2: First of all, it is not true that binomial always looks like normal. If $Xsim Binomial(n, p=0.1)$ it is a skewed distribution, which does not look like a bell shaped distribution.

Answered by bdeonovic on December 15, 2021

You use logistic regression when your response variable is binary (0/1) or a proportion (10/30) so you can't relate it to a gaussian distribution which is continuos and has no boundaries. That's why you specify "family="binomial" to perform logistic regression in R and family="gaussian" to perform linear regression.

Answered by Aghila on December 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP