# Deep Dive Into Logistic Regression: Part 1

Logistic regression is arguably the most widely used machine learning algorithm in production systems when it comes to classify or predict the likelihood of some events to happen, often  in the context of modelling online users behaviour like e.g. the likelihood of a user clicking (a.k.a CTR estimation) or buying something (well, factorization machines are getting some serious momentum as well, to be discussed in some future posts). There is a reason for that: logistic regression is incredibly powerful, scalable, simple to implement and blazing fast to apply online once the model was trained offline.

In this post, we’ll deep dive into the theory behind logistic regression, giving the intuition behind its core concepts and its multiple faces across various fields of statistics and computer science. This will involve some maths, but nothing too deep assuming you have some notions of calculus and core statistics.

In a future post (yet to be written 🙂 ) we’ll be much more concrete and deep dive into the implementation details of logistic regression, and go over some tricks like the hashing trick and the per coordinate adaptive learning rate which are making logistic regression works very well in practice on real (big) data sets.  In this future post, we’ll go over a beautifully simple and elegant implementation of online logistic regression including all those tricks, and also use a very powerful and popular software to use it at scale (Vowpal Wabbit).

## A classical derivation of logistic regression

We’ll start by introducing a standalone description of logistic regression, similar to what you can find in any classical introduction to machine learning course (e.g. that one to cite the most popular of them all).

So you have a training set of N examples  $$\{ (x^{(1)} , y_1) , …, (x^{(N)} , y_N) \}$$ where $$x^{(i)} ∈ \{0, 1\}^d$$ is a sparse binary feature vector in a d-dimensional space, a.k.a. the signals or features of the $$i$$th training example (more on that signals representation later), and $$y_i \in \{0,1\}$$ is the label associated to that example (which could represent a click/non click, spam/not spam, malignant/benign, …):

$$\left\{ \begin{array}{ll} x_1^{(1)},…,x_d^{(1)} & y_1 \\ … \\ x_1^{(N)},…,x_d^{(N)} & y_N \end{array} \right.$$

To make a prediction for a given signal vector $$x ∈ \{0, 1\}^d$$ ,  the logistic regression model proposes to take a linear combination $$\theta^Tx$$  where $$\theta$$ is a vector of parameters (weights) $$\theta_1, … , \theta_n$$ , and to project it into the $$[0..1]$$  range by applying the logistic (or sigmoid) function directly to that linear product,  giving the following model representation:

$$h_{\theta}(x) = logistic( \theta^Tx ) = \frac{1}{1+e^{-\theta^Tx}}$$

The usual interpretation of $$h_{\theta}(x)$$  is that it represents the estimated probability that $$y=1$$  on input $$x$$, in other words:  $$h_{\theta}(x) = p(y=1|x)$$ .  Then, if you have to use that number to predict weather $$y = 1$$ or $$y =0$$, some threshold is picked , either simply 0.5 (i.e. predicting $$y = 1$$ when  $$h_{\theta}(x) >= 0.5$$ and 0 otherwise) or some empirically chosen threshold using the classifier ROC curve (c.f. my other post for more details on that).

Note that logistic regression is a linear classifier given that its decision boundary is a linear combination of the input. Indeed, if your threshold is e.g. 0.5, then you have $$y = 1$$ when  $$h_{\theta}(x) >= 0.5$$ , i.e. when $$\theta^Tx >= 0$$ which is a linear decision boundary.

Let’s now talk about the cost function which is the most important part when building a model given that it is what need to be minimised on the training data to learn the optimal weight vector$$\theta$$. Given the model representation, we cannot take a standard cost function based on MSE because it would make it non convex. All the power  of logistic regression is in its cost function which looks as follow:

$$Cost(h_{\theta}(x),y) = \left\{ \begin{array}{ll} -log(h_{\theta}(x)) & \textrm{if} \quad y =1 \\ -log(1-h_{\theta}(x)) & \textrm{if} \quad y =0 \end{array} \right.$$

The beauty behind that cost function is first that it is very intuitive, because when you predict 0 instead of 1 (or 1 instead of 0), then your cost tends to infinity (and thus you penalize the learning algorithm by a very large cost), but most importantly, this cost function is convex (check here for a proof), thus allowing to use any standard gradient descent based optimization algorithm .

Note that this function can be written $$-[y\thinspace log(h_{\theta}(x)) +(1-y) log(1-h_{\theta}(x))]$$  (just replace $$y$$  by 0 or 1 to be convinced). We’ll denote $$Cost(\theta)$$ the average cost on the whole training set$$\{ (x^{(1)} , y_1) , …, (x^{(N)} , y_N) \}$$ , which is defined as:

$$Cost(\theta) = -\frac{1}{N}\sum\limits_{i=1}^{N} [y_i\thinspace log(h_{\theta}(x^{(i)})) +(1-y_i) log(1-h_{\theta}(x^{(i)}))]$$

This is also sometimes called the logarithmic loss. You can define a multi-class version of it (when your output can take more than 2 values) , see e.g. here or here for some intuitive explanations.

So, bottom line, we need to find the optimal weight vector $$\theta$$ by solving $$\underset{\theta}{min} \thinspace Cost(\theta)$$ . To do so, gradient descent is the natural tool. We simply need to compute the partial derivative of $$Cost(\theta)$$ according to each weight $$\theta_j$$ of  $$\theta$$,  i.e. $$\frac{\partial }{\partial \theta_j} Cost(\theta)$$ . We won’t go into the details of the actual derivative calculation (you can find it e.g. here ) but just remember the notations:  the $$i$$ training example $$x^{(i)}$$  is  a vector $$(x_1^{(i)},…,x_d^{(i)})$$ , and  $$\theta^Tx^{(i)} = \theta_0 + \theta_1 x_1^{(i)} + … + x_d^{(i)}$$ and thus, for instance, $$\frac{\partial }{\partial \theta_j} \theta^Tx^{(i)} = x_j^{(i)}$$ . The result of the calculation of the partial derivative gives:

$$\frac{\partial }{\partial \theta_j} Cost(\theta) = \sum\limits_{i=1}^{N} ( h_{\theta}(x^{(i)}) – y_i )x_j^{(i)}$$

This concludes all what is needed to solve $$\underset{\theta}{min} \thinspace Cost(\theta)$$ to find  the optimal weight vector $$\theta$$ from our training data. Indeed,  assuming some learning rate $$\alpha$$ , we simply have to iterate enough times over updating all the weights $$\theta_j$$ of $$\theta$$  using the gradient step below, until we observe that the cost is not reducing anymore :

$$\theta_j = \theta_j \thinspace – \alpha \sum\limits_{i=1}^{N} ( h_{\theta}(x^{(i)}) – y_i )x_j^{(i)}$$

## How to interpret the learned weights?

At the end of your learning procedure via gradient descent as described above, you end with an “optimal”  weight vector $$\theta = (\theta_0,…, \theta_d)$$ , with $$\theta_j$$   the weight associated with the input signal $$x_j$$ . In a simple linear regression model, the interpretation for that weight would be that if the corresponding signal  $$x_j$$ increases by one unit, then the predicted output increases by $$\theta_j$$  units. In logistic regression it cannot be really interpreted that way given that we’re dealing with the sigmoid function and probabilities.

To understand how to interpret the learned weights in logistic regression, we first need to define and understand the notion of odds ratio. Let’s say that the probability of some event to happen (e.g. a basketball team winning a game) is $$p=0.8$$ . The probability of them loosing is $$1-p = 0.2$$ . The odds ratio is simply defined as the ratio between probability of success  and probability of failure,  $$\frac{p}{1-p}$$ i.e. 0.8 / 0.2 = 4 in our example. The interpretation is that the odds for the basketball team to win are 4 to 1.

How does that relate to logistic regression? To answer, you just need to know that the inverse function of the logistic function is the logit function . We thus have:

$$logit(h_{\theta}(x)) = logit(logistic(\theta^Tx)) = \theta^Tx$$

Let’s remind that  $$h_{\theta}(x)$$ represents the probability of the outcome being 1 (given a signal vector  $$x$$ ). Let’s denote that probability p. We thus have:

$$logit(p) = \theta^Tx$$

Now the interesting part is that $$logit(p) = log(\frac{p}{1-p})$$ . Noticed $$\frac{p}{1-p}$$? Yep, that’s the odds ratio defined above 🙂 . In other words, logistic regression is a model relating the log odds probability of the outcome as a linear combination of the input signals:

$$log(\frac{p}{1-p}) = \theta_0 + \theta_1x_1 + … + \theta_dx_d$$

We can now interpret the meaning of a weight $$\theta_j$$ : if the signal $$x_j$$ increases by one unit (or if it is present in case it is a boolean signal), then it increases by $$\theta_j$$ the log odds of the outcome. Even more interpretable, if you take the exponent of both sides in the expression above you get:

$$\frac{p}{1-p} = e^{ \theta^T x} = \prod\limits_{j=0}^{d}e^{ \theta_j x_j}$$

which gives a direct relation with the odds and thus an even more simple interpretation of the weight $$\theta_j$$  : the value  $$e^{\theta_j}$$ directly gives you the increase in the odds of the outcome if the value signal $$x_j$$ increases by one unit (or if it is present in case it is a boolean signal) . Example: if one of your signal is a boolean “already won NBA finals” for your predicting probability of a basketball team to win, and that it gets a weight of say $$1.2$$ , the interpretation would be: if the team already won an NBA finals, then it increases its odds of winning by $$e^{1.2} \approx 3.32$$  , meaning an increase of 232% (i.e. $$(3.32-1)*100$$ ) in the odds of winning.

Bottom line: If a signal $$x_j$$ ends up with  a weight $$\theta_j$$ in logistic regression, it means that if the signal increases by one unit (or just if it is equal to 1 in case of boolean signal), then it increases the odds of the outcome to be 1 (e.g. a click happening) by $$(e^{\theta_j} -1)*100$$%.

## Log Loss vs. Cross Entropy vs. Negative Log Likelihood??

The concept behind logistic regression is so remarkable and efficient that it arose from  various different fields, including different branches of computer science and statistics, and often, you stumble upon different ways of deriving it, including various different names for the cost function or what needs to be maximised or minimised etc.., which might make the whole thing quiet confusing. For instance, in NLP, logistic regression (more precisely the multi-class version of it) is often called Maximum Entropy (or MaxEnt), first defined in that paper .  In this section, i’ll just recall the probabilistic view of logistic regression and connect the dots between cross-entropy, MLE, negative log likelihood, and logLoss .

First, entropy is a powerful concept invented by Claude Shanon who basically set the ground for information theory (if you want to get the gist of it from scratch, check this very nice vulgarization video). Cross-entropy is often used as a way to measure the difference between two probability vectors in the context of multinomial classification (a generalisation of the binary classification problem we’re interested in ), c.f. e.g that short video .  The “binary” version of cross entropy (i.e. its particular case when you have only two output classes like in our setting) is defined over the two vectors $$p = (y, 1-y)$$ and $$q = (\hat{y} , 1-\hat{y})$$ where $$y$$ is the observed true value and $$\hat{y}$$  is the prediction:

$$H(p,q) = -\sum_{i=1} p_i log q_i \\ = -ylog(\hat{y}) – (1-y)log(1-\hat{y})$$

This gives you a measure of “disorder” between the two vectors (the true one and the predicted one). In our case, $$\hat{y} = h_{\theta}(x)$$ , so the average cross entropy on the whole training set is:

$$-\frac{1}{N}\sum\limits_{i=1}^{N} [y_i\thinspace log(h_{\theta}(x^{(i)})) +(1-y_i) log(1-h_{\theta}(x^{(i)}))]$$

Wait, did you notice? This is exactly the log loss cost function we had in the first section!!

And there is more.

Let’s move to another very popular concept in machine learning called Maximum Likelihood Estimation (MLE) . MLE is a simple yet very powerful tool to estimate a (set of) parameter(s) based on observed data (if you have never heard about it and need an explanation “for dummies” then you can check this video for the high level idea and that one for a specific example). When you want to use MLE, the first step is to write down the probability of observing the data (in our case the $$y_i, …, y_N$$ ) given the input signals $$x^{(1)}, …, x^{(N)}$$ and the vector of parameters $$\theta$$ :

$$Pr(y_1, …, y_N |x^{(1)}, …, x^{(N)} , \theta) = \prod\limits_{i=1}^{N}Pr(y_i| x^{(i)}, \theta)$$

Given that in our case $$y_i$$ is either 0 or 1, a common trick is to write that:

$$Pr(y_i| x^{(i)}, \theta) = \\ Pr(y_i=1 |x^{(i)}, \theta)^{y_i} \thinspace Pr(y_i=0 |x^{(i)}, \theta)^{1-y_i}$$

The actual likelihood function always inverse the parameters in the notation to make clear that we are looking for an optimal $$\theta$$ given the fixed observations of the training set:

$$L(\theta , x^{(1)}, …, x^{(N)} | y_1, …, y_N ) = \\ \prod\limits_{i=1}^{N} Pr(y_i=1 |x^{(i)}, \theta)^{y_i} \thinspace Pr(y_i=0 |x^{(i)}, \theta)^{1-y_i}$$

Note that the same form could have been obtained without the need for the previous trick by simply noticing that in the case of binary classification, the proper likelihood function is Bernoulli .   Now, we denote $$Pr(y_i=1 |x^{(i)}, \theta)$$  as $$h_{\theta}(x)$$ (exact same notation as in the first section). We’ll also denote $$L(\theta)$$ the likelihood function for convenience. MLE thus suggest we find the $$\theta$$  maximizing that likelihood function (hence the name maximum likelihood), in other words:

$$\underset{\theta}{\arg\max} L(\theta) = \underset{\theta}{\arg\max} \prod\limits_{i=1}^{N}h_{\theta}(x^{(i)}) ^{y_i} \thinspace (1-h_{\theta}(x^{(i)}))^{1-y_i}$$

Since the next step is always to find a derivative of the likelihood, you almost always take the log of the likelihood  since it transforms the product into a sum (on which it is much easier to apply derivatives), and that the logarithm function is monotonic (strictly increasing), and thus maximizing the log likelihood is equivalent to maximizing the likelihood, as well as minimizing the negative log likelihood. So applying a log on the above product gives:

$$\underset{\theta}{\arg\max} \thinspace log \thinspace L(\theta) = \\ \underset{\theta}{\arg\max} \sum\limits_{i=1}^{N} y_i log(h_{\theta}(x^{(i)})) (1-y_i)log(1-h_{\theta}(x^{(i)}))$$

Instead of looking for the maximum of the log likelihood, you can equivalently look for the minimum of the negative log likelihood. If you take the average negative log likelihood on the training set, what do you obtain? you guessed it, once again, the exact same log loss cost function we found both in the first section and also via cross entropy!!!

As a final link between logistic regression and other well known concepts in ML or statistics, logistic regression is often compared with Naive Bayes, see here (wikipedia), here (more detailed book chapter) and here (high level Quora answer). But the point is that naive bayes can be seen as a generative version of logistic regression (which is a discriminative model, here is a nice Quora discussion if you want to understand the difference between generative and discriminative models ).

Bottom line: in the context of logistic regression, when you’ll hear about log loss or cross entropy or negative log likelihood, you’ll now know why and how they are so closely related.

I hope you enjoyed  that post, and stay tuned for the future (much more concrete) post around how to implement and use logistic regression in practice!

# A Data Science Exploration From the Titanic in R

Kaggle offered this year a knowledge competition called “Titanic: Machine Learning from Disaster” exposing a popular “toy-yet-interesting” data set around the Titanic. The goal is to  predict as accurately as possible the survival of the titanic’s passengers based on their characteristics (age, sex, ticket fare etc…)

.

In that post, we’ll use that data set in order to:

1. Illustrate through a comprehensive example a set of useful tools/packages to do some predictive modelling from the R statistical framework.
2. Take the opportunity of the example to illustrate the process and kind of tricks that it takes to improve/tune a predictive model.

The whole code creating all the plots/stats and models exposed in that post and also building an output reaching a score 0.79426 on the leaderboard can be found on github here  or on Rpubs here (built with Knit HTML from R studio ).

Preliminaries

First, download the test and training set from the data page of the competition (here is a zip of the two small files in case the page from kaggle is removed in the future).

Once you loaded the dataset into a data frame, you can do some data analysis/explorations.  Even though that part is critical to start playing and feeling the data, I won’t go into details because there already were blog posts written about that, in particular that one is a very nice R version of the getting started with excel data exploration tutorial on Kaggle’s website.

However, i’ll just illustrate a nice simple and effective way of observing one important aspect of the data: missing values.

The Amelia R package is a toolbox around missing values, in particular for performing imputation of the missing data. Getting a visual and global insight about missing data in the test and train set is as simple as that:

library(Amelia)
#... code for loading test and train data in a data frame
missmap(rawdata, main = "Missingness Map Train")
missmap(test, main = "Missingness Map Test")

From those maps, you can immediately observe that only the age feature is badly suffering from missing data. Considering how small is the training set, you can hardly just ignore records having a missing age. We’ll see later in the post what kind of strategy we can use to deal with that issue.

## Building/Tuning models with Caret

The caret package is a kind of toolbox for homogenising the many existing R packages for classification and regression and also provide out of the box a standard way to perform common tasks like model parameters tuning and more. Also, the author (Max Khun) did an amazing job at documenting the package in the vignettes (here or here for a longer but older version) and on the package dedicated website.

Here is a snippet of code where i successively train a random forest and a gradient boosting machine (GBM) using the same train function from caret.

forest.model1 <- train(survived ~ pclass + sex + title + sibsp +parch ,
data.train,
importance=TRUE)

fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)

gbm.model2 <- train(survived ~ pclass + sex + title + sibsp +parch ,
data.train,
distribution = "gaussian",
method = "gbm",
trControl = fitControl,
verbose = FALSE)

We’ll discuss later the features used in the formula but note the fitControl parameter which is passed in the call for training the GBM. This parameter allows to completely define the way the model parameters will be tuned. In that example, the model parameters of the GBM (namely interaction.depth, n.trees and shrinkage, see output below) were compared using a repeated 10-fold cross validation with accuracy being the metric for comparison, but everything is tuneable for that purpose (you can even pass a grid of specific values to compare for each model parameter).

712 samples
13 predictors
2 classes: 'yes', 'no'

No pre-processing
Resampling: Cross-Validation (10 fold, repeated 10 times)

Summary of sample sizes: 642, 640, 642, 641, 640, 640, ...

Resampling results across tuning parameters:

interaction.depth  n.trees  Accuracy  Kappa  Accuracy SD  Kappa SD
1                  50       0.8       0.565  0.0436       0.0964
1                  100      0.801     0.567  0.0436       0.0965
1                  150      0.801     0.568  0.0434       0.096
2                  50       0.795     0.548  0.0426       0.097
2                  100      0.801     0.559  0.0437       0.0999
2                  150      0.804     0.565  0.0435       0.1
3                  50       0.805     0.568  0.0449       0.102
3                  100      0.807     0.573  0.0464       0.106
3                  150      0.809     0.576  0.0442       0.1

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were interaction.depth = 3, n.trees = 150 and shrinkage = 0.1.

Also, you can easily visualize variable importance (you need to specify importance=TRUE in the train function, as we did, for having it):

You can observe that the variable value with the most importance is the title Mr . The interesting part is that the feature “title” was not initially in the data set and was artificially created (we’ll detail a bit more about it later in the post). But overall, caret offers a very nice framework for easy models comparison and tuning with proper/uniform built-in cross-validation routines.

One thing though that is so true and said in perfect way in this must-watch killer talk: “Don’t get stuck in algorithm land! Focus on putting better data in the algorithm”. We’ll see an example illustrating that later in the post.

Pick the best threshold for your classifier using ROC curves

Most classifiers usually output the probability of an example belonging to a specific class (here ‘survived’ or ‘died’). When the only matter is to optimise accuracy (as it is usually the case in competitions), it is useful to pick the optimal threshold/cutoff for assigning one class or the other.

ROC curves can be used for that and also to assess the robustness of your model. If you’ve never heard about ROC curves, this article gives the basic intuition and that paper goes much more into details while still being crystal clear (i warmly recommend the later if you’re interested in the subject). For a standalone very clear example in R, this post is what you need (the code below is inspired from it).

The pROC package allows to easily analyse and display ROC curves. Here, we’re interested in the threshold corresponding to the top left corner of the curve maximising sensitivity and specificity.

#code inspired from http://mkseo.pe.kr/stats/?p=790
result.predicted.prob.model1 <- predict(forest.model1, data.test, type="prob")
result.roc.model1 <-  roc(data.test$survived, result.predicted.prob.model1$yes)
plot(result.roc.model1, print.thres="best", print.thres.best.method="closest.topleft")

result.coords.model1 <- coords(  result.roc.model1, "best", best.method="closest.topleft",
ret=c("threshold", "accuracy"))
result.coords.model1

Which will output both a graph:

and high level information about the curve, e.g. :

Call:
roc.default(response = data.test$survived, predictor = result.predicted.prob.model1$yes)

Data: result.predicted.prob.model1$yes in 78 controls (data.test$survived yes) > 65 cases (data.test$survived no). Area under the curve: 0.931 Note in particular the Area under the curve (a.k.a AUC) data point which is sometimes used to assess the robustness/quality of your model, although it has been questioned a lot and often criticised to not be a precise/useful classification performance measure (a small discussion around it can be found here). In other words, you’re often better off relying on your K-fold cross validation measures to assess your out-of-sample performance (c.f. the previous section on caret). Tweak and tricks I’ve hinted earlier that the number of missing ages was too high and the training set too small to just ignore the records having a missing age. At least for me, any attempt to impute the missing ages (either in naive or more sophisticated ways) didn’t lead to any significant accuracy improvement on the 10-fold cross validation test. Turns out that extracting the title (i.e. Mr or Mrs. etc…) in the Name attribute of the data set did lead to an improvement (from the competition’s forums, i saw that few people used that feature as well). Let’s have a look at the age distributions per extracted title in the training set (some rare occurrences of titles were aggregated into larger titles, e.g. “Capt”, “Col”, “Major”,”Sir”, “Don”,”Dr” were mapped to “Mr”): This somehow matches the intuition (though I didn’t know that in apparently old/traditional english, “Master” denotes a young/unmarried man). And it also makes sense intuitively that Title is a good proxy for the too many missing ages, allowing for totally ignoring the age feature and thus keep all the data in the training set, without introducing any potential noise with an imputation method. When i’ve plugged in this new Title feature into the random forest, i saw an improvement from 0.785 to 0.801 on my 10-fold cross validation out-of-sample accuracy estimation, and it was reflected in my submission on the public leaderboard where i jumped to the top 5% best submissions at that time. Note that an improvement on your cross validation is not always reflected on the leaderboard, sometimes even the opposite (c.f “Lesson One” from this very cool blog post by @rouli, highly recommended). Note also that this particular competition lasts 1 year and was just for learning purpose, so there are thousands and thousands of participants, including not few people who obviously spent useless time to extract the answers from publicly available lists (e.g. here or here) to get a near perfect score (though you could use them to know you near real final score on the private leaderboard if you can’t wait the end of the competition, but still kind of pointless). Finally, more things can be done to try improve the accuracy even more, an obvious one being to combine multiple models together (majority vote is often used in binary/multi-class settings) but we won’t cover that in this post. Conclusion We explored on a comprehensive example how R can be used to build and tune quickly robust predictive models which are significantly outperforming the baseline. Of course, it is somehow a toy example but it was interesting enough to explore some important aspects needed when building predictive models. For much bigger data sets (both in terms of training set size and/or number of features in the data) you might need to introduce different/additional technical and theoretical tools that we might explore in future posts. Also, note that a competition settings might be very different than a real production settings. Not only talking about why Netflix never implemented the model that won the$1M challenge,  but also the whole infrastructure that you’d need to build in order to do big data science at scale on many different problems (Scala is quickly becoming a trend around that, check those killer slides and talk by my friend @BigDataSc from LinkedIn and @ccservers from eBay for more on that ).

I’ll conclude by citing again this awesome sentence from this must-watch talk by @nmkridler : “Don’t get stuck in algorithm land! Focus on putting better data in the algorithm”. I really think that this is what data science is all about.