TransWikia.com

Which likelihood function is used in linear regression?

Cross Validated Asked by floyd on December 18, 2021

When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
$P(y|x,w)$
$P(y,x|w)$
All pages that I read on the internet use the first one.
I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$
so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.
The second function looks better because it means “what is the probability of the parameter giving the data(x AND y)” but the first function doesn’t show that.
Is my point correct or not?Is there any difference?

3 Answers

Linear regression is about how $x$ influences/varies with $y$, the outcome or response variable. The model equation is $$ Y_i =beta_0 + beta_1 x_{i1} + dotsm + beta_p x_{ip}+epsilon_i $$ say, and how $x$ is distributed doesn't by itself give information about the $beta$'s. That's why your second form of likelihood is irrelevant, so is not used.

See Definition and delimitation of regression model for the meaning of *regression model**, also What are the differences between stochastic v.s. fixed regressors in linear regression model? and especially my answer here: What is the difference between conditioning on regressors vs. treating them as fixed?

Answered by kjetil b halvorsen on December 18, 2021

As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.

Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. Succinctly, we only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.

To see why, consider the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcal{N}(0, sigma^2)$, we want the estimate function $hat{f}$ we produce to be as close to $f$ as possible (on average, and measured by e.g. least squares).

So suppose we have known data $(x_1, f(x_1)), ..., (x_N, f(x_N))$. Ultimately we just want to pull $hat{f}(x_1), ..., hat{f}(x_N)$ close to $f(x_1), ..., f(x_N)$. It doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hat{f}$ is close to $f$ on those values ($f$ being how $y$ "responds" to $x$).


Here's how the math works out. From an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$, $$min_f text{EPE}(f) = min_f mathbb{E}(Y - f(X))^2$$ omitting some computation we obtain the minimizing $f$ to be $$f(x) = mathbb{E}(Y | X = x)$$ so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.

However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.

Answered by Drew N on December 18, 2021

That's a good question since the difference is a bit subtle - hopefully this helps.

The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).

Usually, in simple linear regression,

$$Y = beta_0 + beta_1 X + epsilon$$

you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.

Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore

$$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$

If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.

Answered by Samir Rachid Zaim on December 18, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP