For some of you, Linear Regression is just a distant memory. For some, it is the gateway to a new wonderland introduced early in your machine learning course. For me, it is supposed to be my bread and butter among other algorithms, however, I find a quick refresher helpful and dandy now and again. Ergo, this piece:
Allow me to journey back to middle school for a quick minute. Given two points in a Cartesian coordinate system, how many lines can one fit them?
Only one. how is that line found? Given the general equation for a line:
We need to find values for Theta_0 and Theta_1. Two unknowns. Since we have got two points, we can solve the below equations to find Theta_0 and Theta_1:
but did you ever ask that blessed teacher, why even bother fitting a line?
Humans like to conquer, which can only come after comprehension. We develop patterns to understand nature and the world. Then we make predictions. So when we fit that line, we know what pattern those points belong to. In conclusion, give me two data points I’ll tell you the whole story, exactly how it happened.
What if we are given three points that are not on the same line? what about four? what about more?
The ideal straight line that would pass all these points does not exist. No set of Theta_0 and Theta_1 satisfies all those tens of equations at the same time. However, to quench our thirst for making predictions, we aim for the best line possible.
You may rightfully ask why the obsession with straight lines? Doesn’t a curvy line that goes through all these points help more accurately predict the pattern? True, but we don’t always have the computing resources nor the patience to find that perfect fit. Sometimes, it suffices to rig up a quick dirty yet acceptable solution. Sometimes the line that passes all points, even the noise, is an overfit that misleads us.
To find this line, i.e., to find the parameters \Theta_0 and \Theta_1, we go back to the line equation (1) and define what the best line or the fittest line means. You can arbitrarily decide that the set of parameters resulting in the minimum sum of error is the best. Consider f(x) to be our predicted line; the error for point i is defined as:
So we choose parameters theta_0 and theta_1 such that sum of these errors:
Is minimized.
As another strategy, you could say let’s minimize the sum of distances from the points to the line:
You could probably think of ten other definitions for what the best line is. However, there will be some that make more sense than others.
To land on a good definition, let’s mull the problem over from a probabilistic point of view. For this, we should work our way back from the end. Suppose we already know the parameters of the best fitting line. Consider the equation below to be the best fitting line:
For each point (x_i , y_i), there is an error ‘epsilon_i’ which is the error of the best line possible. In other words ‘epsilon_i’ is the most optimal e_i one can achieve. This epsilon accounts for things such as random noise or the fact that a linear model cannot fit all the points hundred percent correctly. Values for theta_1 and theta_0 are set such that for each x_i, the best possible value y_i is achieved.
Let’s make a simplifying assumption that this error variable ‘epsilon_i’ has an independent and identical (IID) Gaussian distribution with a mean equal to 0 and the variance equal to sigma². Meaning, the error at each point has nothing to do with the error at another. For example, when predicting how fast each student of a class is if a student runs faster than our prediction does not mean the next student won’t be slower than our prediction. The Gaussian distribution assumption is also understandable as most random phenomena follow a Normal (Gaussian) distribution. So, put concisely we have:
Now, only the best possible values for parameters theta_0 and theta_1 can result in the smallest errors epsilon_i. To discover these parameters, take one step back and rewrite the above equation to involve theta_1 and theta_0:
Note that theta_1 and theta_2 are fixed values and not random variables. We only condition the probability on x_i. Observing that equation above demonstrates the distribution of the best achievable error, we can interpret the story in these words:
theta_0 and theta_1 are set such that given an input x_i, the probability of achieving output y_i is maximized. Otherwise put, the likelihood that parameters (theta_0, theta_1) could generate our data is maximum.
Maximizing the probability of y_i is equivalent to maximizing the likelihood of (theta_1, theta_0). Calling it likelihood is just the convention in probability books. In other words, the probability above when viewed as a function of y_i is called the probability function and when viewed as a function of (theta_1, theta_0) is called the likelihood function. Depends on which parameters you’d like to fix.
Now let’s take another step back and suppose we don’t know the optimal values for (theta_1, theta_0). So we need to choose (theta_1, theta_0) such that it maximizes the probability of our data (y_i, x_i) points. This is called the principal of maximum likelihood. Now we know that means maximizing the likelihood function of (theta_1, theta_0). Further, you may remember that one can find maximum (or minimum) points in a function by setting the derivatives with respect to variables to zero. Searching for (theta_1, theta_0), we take the derivatives of the likelihood function with respect to these parameters. One last note before calculations: We are interested in maximizing the probability for all (x, y) points, and not just a particular (x_i, y_i). The independence assumption made earlier allows us to calculate this overall probability (likelihood) as the multiplication of probabilities (likelihoods) for each point:
Since taking derivatives of an exponential term is not straightforward, we maximize the log-likelihood which is equivalent to maximizing likelihood because logarithm is a monotonically increasing function:
As you can see, maximizing log-likelihood is the same as minimizing the error term:
This is the familiar regression cost function (Least Mean Squared) that translates to the sum of squares of blue lines in the graph below:
Normal Equation’s method
There are a few approaches to solving for the minimum value of the loss term above. First, let’s take a look at how setting the derivates equal to zero would work:
Rearrange:
Solve for unknowns:
This is all well and good but we rarely run into real-world scenarios where we only deal with two dimensions (and therefore only two equations and two unknowns). Often, a set of tens or hundreds of equations need to be solved for tens or hundreds of unknowns. You can imagine how computationally intractable it can become.
Gradient Descent
Instead of taking a definitive approach, we can turn to numerical techniques to arrive at a set of acceptable values for our unknowns. For example, gradient descent is done by taking baby steps towards a minimum point (a global minimum here since our loss function is a degree 2 convex):
Repeating these update calculations for a number of iterations brings us an acceptable set of values.
Newton’s Method
Another numerical alternative would be Newton’s method. This method is less often used due to the slow convergence and unwieldy Hessian matrix calculations for high-dimensional data. However, it has proven to be a swift solver for lower-dimensional data.
For the sake of simplicity, let’s focus on two-dimensional cartesian where both input and output are real numbers (f: R -> R). Originally, Newton’s method estimates zeros (or roots) of a function via the update formula below.
In effect what this update rule does is illustrated below:
At each iteration, the function is approximated by its tangential line and the root of the tangential line replaces the old value of x. This is repeated until we arrive at a point where the root of the tangential line coincides with the root of the function i.e. our target.
In our scenario, we are looking for the minimum of the log-likelihood function l. As previously mentioned, this is the point where l` (the derivative or gradient function) is equal to zero. Therefore, we can simply replace f in the above formula with l`:
From the figure, you can at least intuitively agree with the fact that convergence is achieved faster with this method. As a matter of fact, only one iteration is enough to find the least min squared error that we have derived (not that hard to prove this for yourself). However, the generalization of this formula for multi-dimensional input vector would be:
In which H represents the Hessian matrix for l and the upside-down triangle is the gradient vector of l with respect to the vector of parameters theta. You can imagine how calculating the inverse of a large matrix can go haywire. Therefore it may be better to go back to gradient descent for high dimensional input features.
Closing notes:
- I hereby would like to express my gratitude to Stanford CS229 for open-sourcing a generous portion of their material.
2. The many loopholes and flaws in my understanding would never be resolved without your questions. So I’d like to thank you in advance for substantiating what I have learned.