Question about the Multiple Linear Regression: why and how does it work?

by user1337   Last Updated May 16, 2019 03:19 AM

I know this question is quite simple and maybe quite naive as well, but I would like to get some help. The general linear model can be expressed as \begin{align*} \textbf{Y} = \textbf{X}\beta + \epsilon \end{align*}

where $Y\sim\mathcal{N}(\textbf{X}\beta,\sigma^{2}\textbf{I})$ represents the random component, $\textbf{X}\beta$ represents the systematic component and the link function is given by the identity $g(\mu) = \mu = \textbf{X}\beta$.

My question is: why do we assume the response variable $\textbf{Y} = (Y_{1},Y_{2},\ldots,Y_{n})$ equals the mean $\mu = \textbf{X}\beta$ plus an error $\epsilon$, which is normally distributed? Moreover, how do we interpret the mean of each component $Y_{i}$? Since each $Y_{i}$ is an observation from the random variable whose distribution describes the data, why should them have different means? Does each $Y_{i}$ represent a "person" from the target population?

Here it is an example. Consider that $\mu_{i} = \beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2}$, where $\mu_{i}$ indicates the average income from the population that lives in the city $i$, $1\leq i\leq 3$, and the $x_{ij}$ represent some features which influence its value. Then, most probably, we will obtain different values for the means $\mu_{1}$, $\mu_{2}$ and $\mu_{3}$. Why does it sound reasonable to state that $Y_{i} = \mu_{i} + \epsilon_{i}$, where $\epsilon$ is normally distributed?

Any help is appreciated. Thanks in advance!



Related Questions


Updated November 15, 2017 01:19 AM

Updated June 05, 2019 10:19 AM

Updated September 16, 2017 16:19 PM

Updated September 25, 2017 12:19 PM