In order to explain the differences between alternative approaches to estimating the parameters of a model, let’s take a look at a concrete example: Ordinary Least Squares (OLS) Linear Regression. The illustration below shall serve as a quick reminder to recall the different components of a simple linear regression model:
In Ordinary Least Squares (OLS) Linear Regression, our goal is to find the line (or hyperplane) that minimizes the vertical offsets. Or, in other words, we define the best-fitting line as the line that minimizes the sum of squared errors (SSE) or mean squared error (MSE) between our target variable (y) and our predicted output over all samples i in our dataset of size n .
Now, we can implement a linear regression model for performing ordinary least squares regression using one of the following approaches:
- Solving the model parameters analytically (closed-form equations)
- Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton’s Method, Simplex Method, etc.)
1) Normal Equations (closed-form solution)
The closed-form solution may (should) be preferred for "smaller" datasets — if computing (a "costly") matrix inverse is not a concern. For very large datasets, or datasets where the inverse of XTX may not exist (the matrix is non-invertible or singular, e.g., in case of perfect multicollinearity), the GD or SGD approaches are to be preferred. The linear function (linear regression model) is defined as:
where y is the response variable, x is an m -dimensional sample vector, and w is the weight vector (vector of coefficients). Note that w0 represents the y-axis intercept of the model and therefore x0=1 . Using the closed-form solution (normal equation), we compute the weights of the model as follows:
2) Gradient Descent (GD)
Using the Gradient Decent (GD) optimization algorithm, the weights are updated incrementally after each epoch (= pass over the training dataset).
The cost function J(⋅) , the sum of squared errors (SSE), can be written as:
The magnitude and direction of the weight update is computed by taking a step in the opposite direction of the cost gradient
where η is the learning rate. The weights are then updated after each epoch via the following update rule:
where Δw is a vector that contains the weight updates of each weight coefficient w , which are computed as follows:
Essentially, we can picture GD optimization as a hiker (the weight coefficient) who wants to climb down a mountain (cost function) into a valley (cost minimum), and each step is determined by the steepness of the slope (gradient) and the leg length of the hiker (learning rate). Considering a cost function with only a single weight coefficient, we can illustrate this concept as follows: