Logistic regression is arguably the most widely used machine learning algorithm in production systems when it comes to classify or predict the likelihood of some events to happen, often in the context of modelling online users behaviour like e.g. the likelihood of a user clicking (a.k.a CTR estimation) or buying something (well, factorization machines are getting some serious momentum as well, to be discussed in some future posts). There is a reason for that: logistic regression is incredibly powerful, scalable, simple to implement and blazing fast to apply online once the model was trained offline.

In this post, we’ll deep dive into the theory behind logistic regression, giving the intuition behind its core concepts and its multiple faces across various fields of statistics and computer science. This will involve some maths, but nothing too deep assuming you have some notions of calculus and core statistics.

In the second part of this series, we’ll be much more concrete and deep dive into the implementation details of logistic regression, and go over some tricks like the hashing trick and the per coordinate adaptive learning rate which are making logistic regression works very well in practice on real (big) data sets. In that second post we’ll also go over a beautifully simple and elegant implementation of online logistic regression including all those tricks. In the third part of this series we’ll demonstrate the usage of a very powerful and popular library implementing logistic regression (and more) at scale: Vowpal Wabbit.

For now, let’s start with the theory 🙂

## A classical derivation of logistic regression

We’ll start by introducing a standalone description of logistic regression, similar to what you can find in any classical introduction to machine learning course (e.g. that one to cite the most popular of them all).

So you have a training set of N examples \(\{ (x^{(1)} , y_1) , …, (x^{(N)} , y_N) \}\) where \(x^{(i)} ∈ \{0, 1\}^d\) is a sparse binary feature vector in a d-dimensional space, a.k.a. the signals or features of the \(i\)th training example (more on that signals representation later, especially in part 2 of that series), and \(y_i \in \{0,1\}\) is the label associated to that example (which could represent a click/non click, spam/not spam, malignant/benign, …):

$$\left\{

\begin{array}{ll}

x_1^{(1)},…,x_d^{(1)} & y_1 \\

… \\

x_1^{(N)},…,x_d^{(N)} & y_N

\end{array}

\right. $$

To make a prediction for a given signal vector \(x ∈ \{0, 1\}^d\) , the logistic regression model proposes to take a linear combination \( \theta^Tx \) where \(\theta\) is a vector of parameters (weights) \(\theta_1, … , \theta_n\) , and to project it into the \([0..1]\) range by applying the logistic (or sigmoid) function directly to that linear product, giving the following model representation:

$$h_{\theta}(x) = logistic( \theta^Tx ) = \frac{1}{1+e^{-\theta^Tx}} $$

The usual interpretation of \(h_{\theta}(x) \) is that it represents the estimated probability that \(y=1\) on input \(x\), in other words: \(h_{\theta}(x) = p(y=1|x)\) . Then, if you have to use that number to predict weather \(y = 1\) or \(y =0\), some threshold is picked , either simply 0.5 (i.e. predicting \(y = 1\) when \(h_{\theta}(x) >= 0.5\) and 0 otherwise) or any other threshold empirically chosen using the classifier’s ROC curve (c.f. my older other post for more details on that).

Note that logistic regression is a linear classifier given that its decision boundary is a linear combination of the input. Indeed, if your threshold is e.g. 0.5, then you have \(y = 1\) when \(h_{\theta}(x) >= 0.5\) . If you draw the sigmoid function \(g(z)\), you can see that it is >= 0.5 when \(z>=0\) . Thus \(h_{\theta}(x) >= 0.5\) when \(\theta^Tx >= 0 \) , which is a linear decision boundary.

Let’s now talk about the cost function which is the most important part when building a model given that it is what need to be minimised on the training data to learn the optimal weight vector\( \theta\). Given the model representation, we cannot take a standard cost function based on MSE because it would make it non convex. All the power of logistic regression is in its cost function which looks as follow:

$$

Cost(h_{\theta}(x),y) = \left\{

\begin{array}{ll}

-log(h_{\theta}(x)) & \textrm{if} \quad y =1 \\

-log(1-h_{\theta}(x)) & \textrm{if} \quad y =0

\end{array}

\right.

$$

The beauty behind that cost function is first that it is very intuitive, because when you predict 0 instead of 1 (or 1 instead of 0), then your cost tends to infinity (and thus you penalize the learning algorithm by a very large cost), but most importantly, this cost function is convex (check here for a proof), thus allowing to use any standard gradient descent based optimization algorithm .

Note that this function can be written \(-[y\thinspace log(h_{\theta}(x)) +(1-y) log(1-h_{\theta}(x))] \) (just replace \( y\) by 0 or 1 to be convinced). We’ll denote \(Cost(\theta)\) the average cost on the whole training set\(\{ (x^{(1)} , y_1) , …, (x^{(N)} , y_N) \}\) , which is defined as:

$$Cost(\theta) = -\frac{1}{N}\sum\limits_{i=1}^{N} [y_i\thinspace log(h_{\theta}(x^{(i)})) +(1-y_i) log(1-h_{\theta}(x^{(i)}))] $$

This is also sometimes called the logarithmic loss. You can define a multi-class version of it (when your output can take more than 2 values) , see e.g. here or here for some intuitive explanations.

So, bottom line, we need to find the optimal weight vector \( \theta \) by solving \(\underset{\theta}{min} \thinspace Cost(\theta) \) . To do so, gradient descent is the natural tool. We simply need to compute the partial derivative of \( Cost(\theta) \) according to each weight \( \theta_j \) of \( \theta \), i.e. \( \frac{\partial }{\partial \theta_j} Cost(\theta) \) . We won’t go into the details of the actual derivative calculation (you can find it e.g. here ) but just remember the notations: the \(i\) training example \( x^{(i)} \) is a vector \((x_1^{(i)},…,x_d^{(i)}) \) , and \(\theta^Tx^{(i)} = \theta_0 + \theta_1 x_1^{(i)} + … + x_d^{(i)}\) and thus, for instance, \( \frac{\partial }{\partial \theta_j} \theta^Tx^{(i)} = x_j^{(i)} \) . The result of the calculation of the partial derivative gives:

$$ \frac{\partial }{\partial \theta_j} Cost(\theta) = \sum\limits_{i=1}^{N} ( h_{\theta}(x^{(i)}) – y_i )x_j^{(i)} $$

This concludes all what is needed to solve \(\underset{\theta}{min} \thinspace Cost(\theta) \) to find the optimal weight vector \( \theta \) from our training data. Indeed, assuming some learning rate \( \alpha \) , we simply have to iterate enough times over updating all the weights \( \theta_j \) of \( \theta \) using the gradient step below, until we observe that the cost is not reducing anymore :

$$ \theta_j = \theta_j \thinspace – \alpha \sum\limits_{i=1}^{N} ( h_{\theta}(x^{(i)}) – y_i )x_j^{(i)} $$

## How to interpret the learned weights?

At the end of your learning procedure via gradient descent as described above, you end with an “optimal” weight vector \( \theta = (\theta_0,…, \theta_d) \) , with \( \theta_j \) the weight associated with the input signal \( x_j \) . In a simple linear regression model, the interpretation for that weight would be that if the corresponding signal \( x_j \) increases by one unit, then the predicted output increases by \( \theta_j \) units. In logistic regression it cannot be really interpreted that way given that we’re dealing with the sigmoid function and probabilities.

To understand how to interpret the learned weights in logistic regression, we first need to define and understand the notion of odds ratio. Let’s say that the probability of some event to happen (e.g. a basketball team winning a game) is \(p=0.8 \) . The probability of them loosing is \(1-p = 0.2 \) . The odds ratio is simply defined as the ratio between probability of success and probability of failure, \( \frac{p}{1-p} \) i.e. 0.8 / 0.2 = 4 in our example. The interpretation is that the odds for the basketball team to win are 4 to 1.

How does that relate to logistic regression? To answer, you just need to know that the inverse function of the logistic function is the logit function . We thus have:

$$ logit(h_{\theta}(x)) = logit(logistic(\theta^Tx)) = \theta^Tx $$

Let’s remind that \(h_{\theta}(x) \) represents the probability of the outcome being 1 (given a signal vector \(x \) ). Let’s denote that probability p. We thus have:

$$ logit(p) = \theta^Tx $$

Now the interesting part is that \(logit(p) = log(\frac{p}{1-p}) \) . Noticed \(\frac{p}{1-p}\)? Yep, that’s the odds ratio defined above 🙂 . In other words, logistic regression is a model relating the log odds probability of the outcome as a linear combination of the input signals:

$$log(\frac{p}{1-p}) = \theta_0 + \theta_1x_1 + … + \theta_dx_d $$

We can now interpret the meaning of a weight \( \theta_j \) : if the signal \( x_j \) increases by one unit (or if it is present in case it is a boolean signal), then it increases by \( \theta_j \) the log odds of the outcome. Even more interpretable, if you take the exponent of both sides in the expression above you get:

$$\frac{p}{1-p} = e^{ \theta^T x} = \prod\limits_{j=0}^{d}e^{ \theta_j x_j} $$

which gives a direct relation with the odds and thus an even more simple interpretation of the weight \( \theta_j \) : the value \( e^{\theta_j} \) directly gives you the increase in the odds of the outcome if the value signal \( x_j \) increases by one unit (or if it is present in case it is a boolean signal) . Example: if one of your signal is a boolean “already won NBA finals” for your predicting probability of a basketball team to win, and that it gets a weight of say \(1.2\) , the interpretation would be: if the team already won an NBA finals, then it increases its odds of winning by \( e^{1.2} \approx 3.32 \) , meaning an increase of 232% (i.e. \((3.32-1)*100\) ) in the odds of winning.

**Bottom line**: If a signal \( x_j \) ends up with a weight \( \theta_j \) in logistic regression, it means that if the signal increases by one unit (or just if it is equal to 1 in case of boolean signal), then it increases the odds of the outcome to be 1 (e.g. a click happening) by \( (e^{\theta_j} -1)*100 \)%.

## Log Loss vs. Cross Entropy vs. Negative Log Likelihood??

The concept behind logistic regression is so remarkable and efficient that it arose from various different fields, including different branches of computer science and statistics, and often, you stumble upon different ways of deriving it, including various different names for the cost function or what needs to be maximised or minimised etc.., which might make the whole thing quiet confusing. For instance, in NLP, logistic regression (more precisely the multi-class version of it) is often called Maximum Entropy (or MaxEnt), first defined in that paper . In this section, i’ll just recall the probabilistic view of logistic regression and connect the dots between cross-entropy, MLE, negative log likelihood, and logLoss .

First, entropy is a powerful concept invented by Claude Shanon who basically set the ground for information theory (if you want to get the gist of it from scratch, check this very nice vulgarization video). Cross-entropy is often used as a way to measure the difference between two probability vectors in the context of multinomial classification (a generalisation of the binary classification problem we’re interested in ), c.f. e.g that short video . The “binary” version of cross entropy (i.e. its particular case when you have only two output classes like in our setting) is defined over the two vectors \( p = (y, 1-y)\) and \(q = (\hat{y} , 1-\hat{y}) \) where \( y\) is the observed true value and \( \hat{y} \) is the prediction:

$$ H(p,q) = -\sum_{i=1} p_i log q_i \\ = -ylog(\hat{y}) – (1-y)log(1-\hat{y})$$

This gives you a measure of “disorder” between the two vectors (the true one and the predicted one). In our case, \(\hat{y} = h_{\theta}(x) \) , so the average cross entropy on the whole training set is:

$$ -\frac{1}{N}\sum\limits_{i=1}^{N} [y_i\thinspace log(h_{\theta}(x^{(i)})) +(1-y_i) log(1-h_{\theta}(x^{(i)}))] $$

Wait, did you notice? This is exactly the log loss cost function we had in the first section!!

And there is more.

Let’s move to another very popular concept in machine learning called Maximum Likelihood Estimation (MLE) . MLE is a simple yet very powerful tool to estimate a (set of) parameter(s) based on observed data (if you have never heard about it and need an explanation “for dummies” then you can check this video for the high level idea and that one for a specific example). When you want to use MLE, the first step is to write down the probability of observing the data (in our case the \( y_i, …, y_N \) ) given the input signals \( x^{(1)}, …, x^{(N)} \) and the vector of parameters \( \theta \) :

$$ Pr(y_1, …, y_N |x^{(1)}, …, x^{(N)} , \theta) = \prod\limits_{i=1}^{N}Pr(y_i| x^{(i)}, \theta) $$

Given that in our case \( y_i \) is either 0 or 1, a common trick is to write that:

$$Pr(y_i| x^{(i)}, \theta) = \\ Pr(y_i=1 |x^{(i)}, \theta)^{y_i} \thinspace Pr(y_i=0 |x^{(i)}, \theta)^{1-y_i} $$

The actual likelihood function always inverse the parameters in the notation to make clear that we are looking for an optimal \( \theta \) given the fixed observations of the training set:

$$ L(\theta , x^{(1)}, …, x^{(N)} | y_1, …, y_N ) = \\ \prod\limits_{i=1}^{N} Pr(y_i=1 |x^{(i)}, \theta)^{y_i} \thinspace Pr(y_i=0 |x^{(i)}, \theta)^{1-y_i} $$

Note that the same form could have been obtained without the need for the previous trick by simply noticing that in the case of binary classification, the proper likelihood function is Bernoulli . Now, we denote \(Pr(y_i=1 |x^{(i)}, \theta) \) as \(h_{\theta}(x)\) (exact same notation as in the first section). We’ll also denote \( L(\theta) \) the likelihood function for convenience. MLE thus suggest we find the \( \theta \) maximizing that likelihood function (hence the name maximum likelihood), in other words:

$$ \underset{\theta}{\arg\max} L(\theta) = \underset{\theta}{\arg\max} \prod\limits_{i=1}^{N}h_{\theta}(x^{(i)}) ^{y_i} \thinspace (1-h_{\theta}(x^{(i)}))^{1-y_i} $$

Since the next step is always to find a derivative of the likelihood, you almost always take the log of the likelihood since it transforms the product into a sum (on which it is much easier to apply derivatives), and that the logarithm function is monotonic (strictly increasing), and thus maximizing the log likelihood is equivalent to maximizing the likelihood, as well as minimizing the negative log likelihood. So applying a log on the above product gives:

$$ \underset{\theta}{\arg\max} \thinspace log \thinspace L(\theta) = \\ \underset{\theta}{\arg\max} \sum\limits_{i=1}^{N} y_i log(h_{\theta}(x^{(i)})) (1-y_i)log(1-h_{\theta}(x^{(i)})) $$

Instead of looking for the maximum of the log likelihood, you can equivalently look for the minimum of the negative log likelihood. If you take the average negative log likelihood on the training set, what do you obtain? you guessed it, once again, the exact same log loss cost function we found both in the first section and also via cross entropy!!!

As a final link between logistic regression and other well known concepts in ML or statistics, logistic regression is often compared with Naive Bayes, see here (wikipedia), here (more detailed book chapter) and here (high level Quora answer). But the point is that naive bayes can be seen as a generative version of logistic regression (which is a discriminative model, here is a nice Quora discussion if you want to understand the difference between generative and discriminative models ).

**Bottom line**: in the context of logistic regression, when you’ll hear about log loss or cross entropy or negative log likelihood, you’ll now know why and how they are so closely related.

I hope you enjoyed that post. If you want to get to the details allowing to make this work at scale and actually see an implementation connecting it all in 30 lines of python, continue to part 2 of this series 🙂 .

great work. very helpful and detailed explanation!

Thanks Wei for the feedback! Glad you liked it!