# Linear regression review: optimal solution and its variance

We review classical linear regression, deriving the least-squares solution from first principles. We also give a derivation for the fit’s covariance matrix — showing that the coefficient estimates are most precise along directions that have been sampled over a large range of values (the high variance directions, a la PCA). Finally, we show that equivalent results follow from a Bayesian approach.

### Introduction

Here, we consider the problem of fitting a linear curve to \$N\$ data points of the form \$(/vec{x}_i, y_i),\$ where the \$/{/vec{x}_i/}\$ are vectors of predictors that sit in an \$L\$-dimensional space and the \$/{y_i/}\$ are the values we wish to predict given the \$/{x_i/}\$. The linear approximation will be defined by a set of coefficients, \$/{/beta_j/}\$ so that
/begin{eqnarray}
/hat{y}_i /equiv /sum_j /beta_j x_{i,j} = /vec{/beta} /cdot /vec{x}_i. /tag{1} /label{1}
/end{eqnarray}
We seek the \$/vec{/beta}\$ that minimizes the average squared \$y\$ error,
/begin{eqnarray} /tag{2} /label{2}
J = /sum_i /left ( y_i – /hat{y}_i /right)^2 = /sum_i /left (y_i – /vec{/beta} /cdot /vec{x}_i /right)^2.
/end{eqnarray}
It turns out that this is a problem where one can easily derive an analytic expression for the optimal solution. It’s also possible to derive an expression for the variance in the optimal solution — that is, how much we might expect the optimal parameter estimates to change were we to start with some other \$N\$ data points instead. Here, we review these derivations, give a simple interpretation to the theoretical variance, and finally show that the same results follow from a Bayesian approach.

### Optimal solution

We seek the coefficient vector that minimizes (/ref{2}). We can find this by differentiating this cost function with respect to \$/vec{/beta}\$, setting the result to zero. This gives,
/begin{eqnarray} /tag{3}
/partial_{/beta_j} J = 2 /sum_i /left (y_i – /sum_k /beta_k x_{i,k} /right) x_{i,j} = 0.
/end{eqnarray}
We next define the matrix \$X\$ so that \$X_{i,j} = /vec{x}_{i,j}\$. Plugging this into the above, we obtain
/begin{eqnarray}
/partial_{/beta_j} J &=& 2 /sum_i /left (y_i – /sum_k /beta_k X_{i,k} /right) X_{i,j} = 0 //
&=& /left ( /vec{y} – /vec{/beta} /cdot X^T /right ) /cdot X = 0./tag{4}
/end{eqnarray}
Rearranging gives
/begin{eqnarray}
/vec{/beta} /cdot X^T X =/vec{y} /cdot X /to
/vec{/beta} = /vec{y} /cdot X /cdot /left( X^T X /right)^{-1} /tag{5} /label{optimal}
/end{eqnarray}
This is the squared-error-minimizing solution.

### Parameter covariance matrix

Now, when one carries out a linear fit to some data, the best line often does not go straight through all of the data. Here, we consider the case where the reason for the discrepancy is not that the posited linear form is incorrect, but that there are some hidden variables not measured that the \$y\$-values also depend on. Assuming our data points represent random samples over these hidden variables, we can model their effect as adding a random noise term to the form (/ref{1}), so that
/begin{eqnarray}/tag{6} /label{noise}
y_i = /vec{/beta} /cdot /vec{x}_i + /epsilon_i,
/end{eqnarray}
with \$/langle /epsilon_i /rangle =0\$ and \$/langle /epsilon_i^2 /rangle = /sigma^2\$.

The \$/epsilon_i\$ terms above propagate and generate some uncertainty in the optimal \$/vec{/beta}\$: Each realization of the noise will generate slightly different \$y\$ values, causing the optimal \$/vec{/beta}\$ values to change slightly. To estimate the magnitude of this effect, we can calculate the covariance matrix of \$/vec{/beta}\$. At fixed (constant) \$X\$, plugging in (/ref{optimal}) for \$/vec{/beta}\$ gives
/begin{eqnarray}
cov(/vec{/beta}, /vec{/beta}) &=& cov /left(/left( X^T X /right)^{-1,T} /cdot X^T /cdot /vec{y}^T, /vec{y} /cdot X /cdot /left( X^T X /right)^{-1} /right) //
&=&/left( X^T X /right)^{-1,T} /cdot X^T /cdot cov(/vec{y}^T, /vec{y} ) /cdot X /cdot /left( X^T X /right)^{-1}
//
&=& /sigma^2 /left( X^T X /right)^{-1,T} /cdot X^T X /cdot /left( X^T X /right)^{-1} //
&=& /sigma^2 /left( X^T X /right)^{-1}. /tag{7} /label{cov}
/end{eqnarray}
In the third line here, note that we have assumed that the \$/epsilon_i\$ are independent, so that \$cov(/vec{y},/vec{y}) = /sigma^2 I.\$

To get a feel for the significance of (/ref{cov}), it is helpful to consider the case where the average \$x\$ values are zero. In this case,
/begin{eqnarray}
/left( X^T X /right)_{i,j} &/equiv& = /sum_k /delta X_{k,i} /delta X_{k,j} /equiv N /times /langle x_i, x_j/rangle. /tag{8}
/end{eqnarray}
That is, \$X^T X\$ is proportional to the correlation matrix of our \$x\$ values. This correlation matrix is real and symmetric, and thus has an orthonormal set of eigenvectors. The eigenvalue corresponding to the \$k\$-th eigenvector gives the variance of our data set’s \$k\$-th component values in this basis — details can be found in ourarticle on PCA. This implies a simple interpretation of (/ref{cov}): The variance in the \$/vec{/beta}\$ coefficients will be lowest for predictors that have been sampled over a wide range of values (eg \$x_1\$ in the figure shown) and highest for predictors that have only been sampled over a relatively small range (\$x_2\$ in the figure).

### Bayesian analysis

The final thing we wish to do here is consider the problem from a Bayesian perspective, using a flat prior on the \$/vec{/beta}\$. In this case, assuming a Gaussian form for the \$/epsilon_i\$ in (/ref{noise}) gives
/begin{eqnarray}/tag{9} /label{9}
p(/vec{/beta} /vert /{y_i/}) /propto p(/{y_i/} /vert /vec{/beta}) p(/vec{/beta}) = /mathcal{N} /exp /left [ -/frac{1}{2 /sigma^2}/sum_i /left (y_i – /vec{/beta} /cdot /vec{x}_i /right)^2/right].
/end{eqnarray}
Notice that this posterior form for \$/vec{/beta}\$ is also Gaussian, and is centered about the solution (/ref{optimal}). Formally, we can write the exponent here in the form
/begin{eqnarray}
-/frac{1}{2 /sigma^2}/sum_i /left (y_i – /vec{/beta} /cdot /vec{x}_i /right)^2 /equiv -/frac{1}{2} /vec{/beta}^T /cdot /frac{1}{/Sigma^2} /cdot /vec{/beta}, /tag{10}
/end{eqnarray}
where \$/Sigma^2\$ is the covariance matrix for the components of \$/vec{/beta}\$, as implied by the posterior form (/ref{9}). We can get the components of its inverse by differentiating (/ref{9}) twice. This gives,
/begin{eqnarray}
/left ( /frac{1}{/Sigma^2}/right)_{jk} &=& /frac{1}{2 /sigma^2} /partial_{/beta_j} /partial_{/beta_k} /sum_i /left (y_i – /vec{/beta} /cdot /vec{x}_i /right)^2 //
&=& -/frac{1}{/sigma^2}/partial_{/beta_j} /sum_i /left (y_i – /vec{/beta} /cdot /vec{x}_i /right) x_{i,k} //
&=& /frac{1}{/sigma^2} /sum_i x_{i,j} x_{i,k} = /frac{1}{/sigma^2} (X^T X)_{jk}. /tag{11}
/end{eqnarray}
In other words, \$/Sigma^2 = /sigma^2 (X^T X)^{-1}\$, in agreement with the classical expression (/ref{cov}).

### Summary

In summary, we’ve gone through one quick derivation of linear fit solution that minimizes the sum of squared \$y\$ errors for a given set of data. We’ve also considered the variance of this solution, showing that the resulting form is closely related to the principal components of the predictor variables sampled. The covariance solution (/ref{cov}) tells us that all parameters have standard deviations that decrease like \$1//sqrt{N}\$, with \$N\$ the number of samples. However, the predictors that are sampled over wider ranges always have coefficient estimates that more precise. This is due to the fact that sampling over many different values allows one to get a better read on how the underlying function being fit varies with a predictor. The final thing we’ve shown is that the Bayesian, Gaussian approximation gives the same results: In this approach, the posterior that results is centered about the classical solution, and has a covariance matrix equal to that obtained by classical approach.