### Rohan #6: Follow-up: statistical interpretation of logistic regression

This is the sixth entry in myjourney to extend my knowledge of Artificial Intelligence in the year of 2016. Learn more about my motives in thisintroduction post.

This blog post assumes sound knowledge of the Logistic Regression algorithm. You can learn more about said algorithm in myfirst blog post.

It’s been a while since my last article (work), but with summer approaching Lenny and I are going full speed, so expect articles on interesting machine learning algorithms (eg. recurrent and convolutional neural networks), our recent work with using machine learning to analyze fMRI scans, and new research papers from institution/companies like Google’s DeepMind we are going to learn about at our trip to ICML 2016 next month. Keep your eyes peeled for some more philosophical write-ups on AI, too!

Today I want to build on my very first article about logistic regression. In particular, I want to discuss the statistical/probabilistic interpretation of logistic regression, which I felt was **missing** from explanations and lectures provided by certain online courses like Andrew Ng’s Machine Learning one (which is still wonderful). I will discuss the intuition behind the logistic regression model formulated in the previousarticle.

Recall:

With the sigmoid function looking like the following:

That is, we model the probability of a dichotomous variable (ie. a cancer tumor’s status) having the state of **class = 1** using the sigmoid function — which asymptotes at 0 and 1, shown by limits — and an input of a weighted sum of inputs **x** using **learned** weights **w (w^Transpose⋅x)**

We will undergo some algebraic manipulation to arrive at an important step that can be interpreted qualitatively (and not just quantitively). First, let’s isolate our weighted sum:

In our final step, we showed that the conditional probability of the cancer tumor being malignant, divided by one subtracted that probability, is equal to the **e** to the power of our weighted sum. Now, **1 – P(y = 1|x;w)** can also be interpreted as **P(y = 0|x;w)** , since **P(y = 1|x;w) + P(y = 0|x;w) = 1** for a dichotomous variable with two possible classes. The probability of **y = 1** occurring divided by the probability of **y = 0** occurring is known as the “odds-ratio”, which is a ratio of how likely an event is to occur to how likely it will *not* occur. For example, if **P(y=1|x;w) = 0.8** , then **P(y=0|x;w) = 1 – 0.8 = 0.2** . Hence the odds-ratio of **y = 1** occurring is **0.8/0.2 = 4** . A log-odds ratio is the logarithm (or natural logarithm) of the odds-ratio value. Let’s take the natural logarithm of both sides to achieve such result:

Here we show our weighted sum is equal to the log-odds ratio. The sigmoid function is actually derived from the log-odds ratio. To show this, I will first derive the domain and range for the log-odds ratio.

Hence:

Thus, the sigmoid function is actually the inverse of the log-odds ratio, which works nicely as the domain and range switch over. As stated before, our linear regression/weighted sum of the input features is equal to the log-odds of a malignant tumor occurring. In classical linear regression, as the value of a weight/coefficient **w_n** increases by 1, the predicted output increases by **x_n** . In conditional probability modelling, as **w_n** increases by 1, the log-odds ratio of having a malignant tumor increases by **x_n.** Now, the interpretation of using weights is that each feature will have a different importance. If, based on the data patterns/correlations, it is found that the feature of a patient’s height (for example) has no bearing on or correlation with the malignancy of a tumor, then the optimal weight for that feature would be zero. The magnitude of a weight hence indicates the importance of the feature it corresponds to, since it will contribute greatly to the weighted sum and override the other contributions to lead to a logistic output closer to 1, increasing the log-odds ratio greatly.

Next, we will talk about the “decision boundary”. The decision boundary, a function that separates the two classes, was presented in the previous article. Here, we will make it more concrete. When **w^Transpose⋅x** **= 0** , then **P(y = 1|x;w) = 0.5** due to the sigmoid function’s y-intercept as mentioned before. Thus, when **w^Transpose⋅x** **= 0** we are at a “turning point” as we switch from a discrete classification of 0 to 1.

As shown in the first figure, our separating function/decision boundary is a hyperplane of the equation **w^Transpose⋅x = 0** as it discriminates between two different regions of classification. On the left hand side, where **w^Transpose⋅x < 0** , the output of **P(y = 1|x;w) < 0.5** thus we predict benign or false. At the line and to the right hand side of the line, we predict malignant or true. This also implies that the log-odds ratio must be at least 0 (where odds-ratio would be 1) for a classification of malignant, and negative for a classification of benign.

If the weights **w** are not optimal, then the separation would be inaccurate, and this could be seen visually in a 2 or 3 dimensional example. We need to “shape” it correctly to create a nice decision boundary. Hopefully, this provides (visual) intuition into why we need to optimize on **w** for our modelling task.

The way we will end up selecting/estimating our parameters **w** is by devising an “error function” (which describes the error created by any arbitrary values for **w** ) and then minimizing on that function. Obviously, the minimum point **(w, ε)** of an error function implies the lowest possible error **ε** (error will almost always exist for a practical, noisy dataset), and hence the most optimal **w** that can be used for future predictions. Alternatively, the function may describe *accuracy* , in which case we look for the *maximum* value. We will now derive the latter type of (accuracy) function.

The Binomial distribution is a discrete (not continuous like a normal distribution) probability distribution of exactly **k** successes (binary, ie. success could be interpreted as **y = 1** ) in **n** independent trials, with each trial having a success probability of **p** and hence failure probability of **1 – p** , and it is a case of the Poisson distribution . The Binomial distribution is then defined as **B(n, p) = nCr⋅p^k⋅(1-p)^(n-k)** . Here, **nCr** computes the number of arrangements of the different trials (a distribution that disregards order), **p^k** computes the probability of achieving the demanded **k** successes in all trials, and **(1-p)^(n-k)** computes the probability of the rest of the trials being failures. In our case, however, each trial has a separate probability of success, conditioned on our weights **w** and input **x** . Therefore, we can treat them as separate individual trials, known as Bernoulli trials, which are trials with two possible, mutually exclusive outcomes: success ( **y = 1** ) and failure ( **y = 0** ). As mentioned before, with a Bernoulli trial, the probability of success **P(y = 1|x;w)** and failure **P(y = 0|x;w)** sum to 1, and the probability of failure hence equals **1 – P(y = 1|x;w)** . The outputs **y** belong to a Bernoulli distribution, which is a special case of the Binomial distribution where **n = 1** (one trial) in **B(n, p)** . Hence, **y ~ B(1, p)** . The number of successes **k** is either 0 (a failure) or 1 (a success), **k ∈ {0, 1}** .

Since **1Ck, k ∈ {0, 1} = 1** , then **B(1, p) = p^k⋅(1-p)^(1-k)** . If **k = 1** , then the outputted probability is **p** since the second term gets raised to the power of 0. When **k = 0** , the outputted probability is **1 – p** (the probability of failure as discussed before). In this sense it can be interpreted as a piecewise function, like so:

This is the probability mass function for the Bernoulli distribution; it is the probability that our discrete variable **y** is equal to exactly **1** . The following figure illustrates the Bernoulli distribution.

These variables correspond to ones we have already formulated. The success indicator (number of successes = 1) **k** corresponds to targeted output **y** (either no success or a success), and the probability **p** is our prediction **P(y = 1|x;w)** . This function can hence also be interpreted as an accuracy function — when **y = 1** and we predict with the probability **P(y = 1|x;w) = 0.6** , then the distribution function will return an accuracy of **0.6** (or error of **0.4** ). If **y = 0** , the function would return an accuracy of **0.4** (or error of **0.6** ).

Now, the probability mass function highlighted above for each individual prediction (per feature vector **x** ) can be multiplied together to produce the joint probability of all these individual cases occurring together, of which there are **m** . Thus, we multiply all the individual accuracies together to create one composite accuracy over our entire dataset. We call this the likelihood function:

The likelihood function is in a sense the opposite of our traditional probability prediction function **sigmoid(w^Transpose⋅x)** , because the likelihood function is the probability that a given **w** occurs given our outcomes **y** , which is why our function is written as **L(w|y)** . A higher value for **L(w|y)** implies that the given **w** is more likely to occur given the outcomes (and input data) and the accuracy is greater.

Our goal, now, is to maximize **L(w|y)** to find the weights **w** and hence probabilities **P(y = 1|x;w)** that makes our data the most likely, since it is what we observed! Maximizing the likelihood function **L(w|y)** is known as the maximum likelihood estimation. To find the maximum point, we can apply basic differentiation: compute the derivative of **L(w|y)** , set the derivative to zero, and solve for the global maximum. However, an issue is presented:

It is unclear how we can compute the derivative of a product of terms of arbitrary length. Instead, what we will do is take the log of the likelihood function, denoted as the log-likelihood function. Since the log(x) function is monotonically increasing, the maximum coordinate for the likelihood function **(w, L(w|y))** and log-likelihood function **(w’, L’(w|y))** will have equal x-coordinates/optimal values for the weights such that **w = w’** even though **L(w|y) ≠ L’(w|y)** , **L’(w|y) = logL(w|y)** . Why we choose log-likelihood, though, becomes clear after this observation:

This is because the logarithm of products of terms is equal to the addition of separate logarithms of the separate terms. With this type of function, it’s very easy to perform differentiation. Let us now compute the log-likelihood function (we actually use the natural logarithm):

This is the final representation that we will use for the log-likelihood function, and we will denote it as **ℓ(w)** . **ℓ(w)** has interpretable properties; if **y = 1** , the accuracy will tend to negative infinity (not accurate at all!) as **P(y = 1|x;w)** tends to **0** , and similarly for when **y = 0** as **P(y = 0|x;w)** tends to **1** . Note that since the sigmoid function asymptotes horizontally at **y = 0** and **y = 1** , our accuracy value is never undefined. When **P(y = 1|x;w) = y** the added accuracy is ln(1) — zero — which is greater than the other accuracies which are negative.

Our job is to now compute the derivative so that we can perform optimization. It is important to note that, though this detail was skipped over before, we will be using partial derivatives with respect to **w** . If we used regular derivatives, it would imply that the variables **y** and **x** (among others including **m** ) would be non-constant and functions of **w** , and we would need to consider their derivatives with respect to **w** by applying the chain rule. However, in our context, our cancer tumor dataset stays constant over all procedures and are not functions of **w** , and hence we using partial derivatives to express and imply this. Our first obstruction is the use of the summation, but we can quickly eliminate this using differentiation rules:

Before we move ahead, let’s precompute the derivative of the probabilistic sigmoid function:

This will help us when we apply the chain rule. Now, let’s compute our main partial derivatives, we’ll express our teams back in terms of the probability prediction function to simplify the process. One matrix calculus rule is employed:

The simplified expression for the derivative of our log-likelihood function **ℓ(w)** is presented in the final step. With this first order derivative in hand, we are now equipped to perform numerical maximization to find the most optimal values for **w** .

Now we set our derivative to 0 to find a maximum point and hence solve for optimum **w** . A convenient property of the derivative of the log-likelihood function **ℓ(w)** is always convex, and hence will only have one turning point (the global maximum) and no other local extremas that we need to evaluate.

Unfortunately, we hit another roadblock; this function is a transcendental equation and there does not seem to be a way to solve for **w** in closed form. In fact, there is no known solution for this equation except in extremely trivial cases ie. when input features are categorical instead of real (eg. 0 and 1) and there are only two observations/data points. Said cases are so rare in real world datasets that we will not explore further.

So now, instead of solving this equation directly we will perform an iterative optimization (where we “move” towards the maximized value over time). One such example is gradient ascent, where we set **w** to some initial guess (usually just a vector of 0s is fine) and then slowly perturb/update it to a more optimal state, until we “converge” and are at an (almost) completely optimal state where the gradient tends to 0. We denote this optimal state of **w** as **w_φ** . Each weight can either be increased or decreased — and only one of these operations will decrease the gradient and make it closer to 0. Hence we want to, over time, progress **w** to become closer to the global maximum. The following figures illustrate this notion of an iterative procedure.

In each of these figures, I illustrate the procedure for a single arbitrary weight **w_j** , just because it will be easier (and possible!) to visualize in 2-dimensions rather than n-dimensions. Hopefully they provide intuition to the name “gradient ascent”, since we slowly but surely climb up the gradient to reach the maximum point. In the first figure, we are at some wholly suboptimal state. We infer that the maximum point is to the right of our current position (and is far ahead), so we take a fairly large step, which leads into the next figure. It’s important to note that in these two figures we have actually spanned over 99 iterations. This is not a fixed number by any means (it will depend on your problem and dataset), but it provides a general guideline for the magnitude of steps required by this procedure. In the second figure, we are closer to the maximum but not there yet. Notice that the gradient is much smaller at this point, so we take even smaller steps over a greater number of iterations, until at iteration 1000 (for example) we have “converged” at the global maximum and no longer move.

So, how do we ultimately infer both the direction and magnitude of each step in gradient ascent? Well, at each iteration, we can take the current gradient (using the derivative computed before) to calculate both variables. In particular, note that if **w <** **w_φ** then we are to the left of the global maximum and need to move right by increasing the magnitude of **w** . At this point, since **ℓ(w)** would be increasing, the gradient would be positive. If **w >** **w_φ** , we would need to decrease the magnitude of **w** . At this point, since **ℓ(w)** would be decreasing, the gradient would be negative. This is a nice property: when we need to increase **w** , the gradient is positive, when we need to decrease it, the gradient is negative. Hence, we can add the gradient vector (called a “Jacobian matrix”) to the current value for **w** at each update stage. However, these gradients may be too large, causing us to overshoot the maximum point and end up diverging — instead of converging — from the maximum. So we usually multiply the gradients by a small number **α** , where **|α| < 1** . **α** is a hyper-parameter, meaning that one can try different values for it until a suitable one is found. Thus, our update rule is:

When **w = w_φ** , the gradient would be zero and we would update by 0 (not update at all). Hence, once we reach the maximum point, we get stuck and converge. It’s important to note that, when actually computed, we don’t reach the maximum point, because at each step the gradient becomes smaller and smaller. Instead, we tend towards the maximum point so **w = w_φ – ε** or we slightly overshoot so **w = w_φ + ε** , where **ε** is an extremely small number. This is rarely an issue, however; we usually set a maximum number of iterations (another hyper-parameter) and then stop after said iterations are complete. *Theoretically* , however, we can assert the following:

And that wraps up gradient ascent. The question, however, remains — how can we compute, by hand, a Jacobian matrix on a proper dataset with complex operations, updating our weights thousands of times? Well, the reality is that we need to have a computer to perform gradient ascent. That’s okay — all of these applications (tumor cancer detection, spam detection, image recognition) exist digitally. However, even with this in mind, it can take up to weeks or months to train on a large-scale dataset used in industry, because of the size of the Jacobian matrix to compute at each iteration and the number of iterations usually required. For this reason, other optimization methods like Newton-Raphson are used. The Newton-Raphson method converges in fewer iterations, but at the same time each iteration requires the computation of a Hessian matrix (second-order derivatives), which is more computationally expensive and time taking. Thus, optimization for these non-linear models is actually a major bottleneck for applications being made with this technology.

H opefully this article introduced the statistical and probabilistic interpretations of logistic regression, as well as some of the derivations behind the mathematical content you’ve seen before. Recently, I’ve been thinking about how many machine learning algorithms can be quite black-box and have been exploring interpretations of their function (which I think is going to be important for us to achieve general intelligence). And that is why I’m especially excited to attend this talk at ICML 2016!