Deep Dive Into Logistic Regression: Part 3

In part 1 and part 2 of this series, we set both the theoretical and practical foundation of logistic regression and saw how a state of the art implementation can all be implemented in roughly 30 lines of code. In this third (and last) post of this series, we’ll demonstrate the use of a very effective and powerful library to build logistic regression models in practice: Vowpal Wabbit.

What is Vowpal Wabbit

Vowpal Wabbit (VW) is a general purpose machine learning library which is implementing, among other things, logistic regression with the same ideas we presented in our previous post like the hashing trick and per-coordinate adaptive learning rates  (in fact, the hashing trick was made popular by that library). A big advantage  of Vowpal Wabbit is that it is blazing fast. Not only because its underlying implementation is in C++, but also because it is using the L-BFGS optimization method. L-BGFS  stand for  “Limited-memory Broyden–Fletcher–Goldfarb–Shanno” and basically approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method using a limited amount of memory.  This method is much more complex to implement than Stochastic Gradient descent (which can be implemented in few lines of code as we saw in our previous post), but is supposedly converging faster (in less iterations). If you want to read more about L-BFGS and/or understand its difference with other optimisation methods, you can check this  (doc from Vowpal Wabbit) or this (nice blog post). Note that L-BFGS was empirically observed to be superior to SGD in many cases, in particular in deep learning settings (check out that paper on that topic).

Input format, Namespaces and more

Many times, i’ve heard people giving up on Vowpal Wabbit because of its input format, even after going quickly over its documentation . So let’s try to present it through a toy (yet real) example that will be used throughout this post to illustrate the main concepts of Vowpal Wabbit. On top of it, i’ll provide an helper tool (in next section) allowing to transform your tabular dataset into the VW input format easily.

So, the dataset we’ll use can be found here and represents the attempt of a bank trying to predict if a marketing phone call will end up in a bank term deposit by the customer, based on a bunch of signals like socio-economic factors of the customer like “does he have a loan?”, etc..

The traditional way to represent such datasets is to have a tsv or csv file, with the header being the name of the signals and each line representing the value of the training example on each signal. Each line of the training set has thus a fixed size, and missing values are just a blank cell or some specific value to indicate that it’s missing. Typically, for that dataset, the header looks like that:

age;job;marital;education;default;balance;housing;loan;contact;day;month;duration;campaign;pdays;previous;poutcome;y

With y being the actual supervision (i.e. did the call ended up in bank term deposit). And a typical training example looks like that:

58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no

In Vowpal Wabbit, there is no header, and each signal name is embedded in the training example itself. For example, the training example above can look like that in Vowpal Wabbit format:

-1 |i age:58 balance:2143 duration:261 campaign:1 pdays:-1 previous:0 |c job=management marital=married education=tertiary default=no housing=yes contact=unknown day=5 month=may poutcome=unknown

Let’s discuss multiple important things there:

  • -1 says that this was a negative example.
  • The |i and |c  are here to specify that the following features are part of a same feature namespace.  Being part of a namespace simply means that all the features in the namespace will be hashed together in a same feature space (this relates to the hashing trick, c.f. the previous post of that series).
  • Here, i artificially created two namespaces: one for numerical features and another one for categorical ones. But that was just to illustrate the idea of namespace .
  • In practice, namespaces can be used for different reasons (check the doc here) but one that is particularly useful  is that it allows you to do feature interactions:
  • For instance, in the command line, using --quadratic ic would combine all the features of the namespaces i and c in our example above to create on the fly 2-way interacting features .  For instance the value of age and job together would be a new signal (maybe if you are a certain age in a certain profession, you’re more or less likely to do a bank term deposit).
  • Note as well that for the numerical features, i used the colon ‘:‘ and for categorical ones i used ‘=‘ .
  • Only the  ‘:‘ will be interpreted by Vowpal Wabbit. Both in training and when applying the model, the weight of the corresponding numerical feature (let’s say age) will be multiplied by the actual numerical value in the weighted linear product of the logistic hypothesis (more on that later).
  • The  ‘=‘ is just cosmetic and for clarity. Technically, writing  married instead of marital=married makes absolutely no difference for the training, except if the value  married could show up in different contexts. E.g. if there were another signal childMarital indicating the marital status of customer’s children,  then you’d have to differentiate if the value married refers to the customer or his children, in which case the feature name would be necessary. Note that if you’d put two such features in different namespaces then they could not be mixed together and the prefix would be again not necessary.
  • Note that for each signal, i’ve used the full name of the signal as a prefix (e.g. age or marital). First, we just saw that for categorical feature, this is not necessarily  required. For numerical signal though, it is (i.e. you cannot just throw a number without context). Now, for huge training sets, you don’t necessarily want to have a long string repeated millions (or more) of times. A good compromise is to have a mapping between signal names and very short string (like e.g. F1, F2, F3 ….). In the following section, i provide some code that allows to generate such training set with signal names mapping.
  • There is a nice answer on Quora here exposing a short cheat-sheet  to remind those and how to encode boolean, categorical, ordinal+monotonic or numerical variables in VW.
  • Last but not least, one thing i love about this format, is that it is very adapted to sparse data. Think that you have thousands of features or maybe just a list of words, then you don’t care about the order of the features or the missing values, you just  throw the features with the right prefix and/or in the right namespace and you’re done. VW will then hash them in their proper bucket in their proper hashing namespace.

How to transform your TSV/CSV datasets into VW format

Most often, classification or regression training sets are coming in the form of TSV or CSV files as mentioned previously.  Transforming them into the VW input format is not difficult, but it does require a minimum of attention. Indeed, depending on the training set, the target variable (or label) might be a word like “yes/no” or a number like “1/0” , while VW requires it to be -1/1 . Also, if a signal is numerical or categorical, in VW you need to transform it into different things (using e.g ‘:’ for numerical features, c.f remarks in previous section).

I wrote a very simple java (8) class that does this, find it here and feel free to use it.  You’ll just need to create (or edit an existing) method there to set up the characteristics of your data set. It doesn’t use any external library (other than pure java 8 libraries) so you if don’t have a Java IDE already installed, you can easily edit it from a text editor and compile/run it from command line.

Then you simply specify:

  • the separator (e.g. ‘\t’ or ‘,’ or ‘;’)
  • the name of the target variable (as it appears in the header)
  • the value of the positive target variable in the dataset (e.g. ‘yes’ or ‘1’ or ‘click’)
  • two separate lists: the list of names of numerical variables and the list of names of categorical variables (the names must be in the header as well).

All those are specified as parameters inside a side method that you just invoke in the main (which is then invoking the core method of that class calledtabular2VWGenerator). You have two examples of such side methods in the code:  generateBankTraningSet representing our dataset discussed above   and generateDonationTrainingSet representing another more complex dataset with a lot a sparse features (check the full list here).  You  invoke the appropriate method from the main.

The program then parses each line of the original training set, and, based on the list of numerical/categorical variables names, will generates two files:

  • the corresponding appropriate training examples in VW format (which also take care of missing values, that are assumed to be empty string, even though you can change that in the code). Feature names from the header are transformed into short names: F1, F2, … This is to make the training set file weight lower (it does makes a difference for huge datasets)
  • A small “.txt” file, mapping the short signal names with the original signal name from the header (e.g. F0 corresponds to age).

Important note: as in the example in previous section, the program is separating the numerical and categorical features into two namespaces (respectively named i and c ).  You can also decide to put all the features in the same namespaces (c.f. last parameter of the tabular2VWGenerator method). For our previous example, a training example will look like this:

-1 |i F0:58 F5:2143 F11:261 F12:1 F13:-1 F14:0 |c F1=management F2=married F3=tertiary F4=no F6=yes F8=unknown F9=5 F10=may F15=unknown

Note that the program can easily be enhanced to e.g. support as input multiple lists, each one would represent a namespace, and in the list, you could represent the feature type as a character, e.g. one of the list could look like {“age:n”, “balance:n”, “education:c”} and the program would parse this and know that age is numerical and education is categorical and encode them accordingly. Feel free to modify it!

The VW command line and its powerful options

Once you have your training set in the VW input format, you can start playing around with building some models from the command line. To illustrate it, we’ll take the small dataset we mentioned before about predicting bank term deposit. You can find here the training set in the VW input format and its short name signal mapping  (which were created using the tool described in previous section).

Let’s start by a first command to train a logistic regression model:

vw train.vw -f model.vw --loss_function logistic

It is pretty much self explained (-f is to specify the filename of the output mode and – –loss_function specifies which loss function to use, logistic in our case).

The output will show you some useful information on the progress of the training, along with the final obtained loss (average loss = 0.253874 in that case).

Then, to actually use the model on a separate test set (more later on how to easily create one), you simply do:

vw test.vw -t -i model.vw -p preds.txt --link logistic

The -t option specifies that you’re in test mode and VW will thus ignore the label of the training examples. -i specifies the model to use (typically the one that was created by the previous training command). --link logistic  says that the logistic regression is applied on top of the linear combinations. Without it, the file preds.txt will contains only the result of  \( \theta^Tx \) and not the sigmoid function applied on top of it.

Some options i found useful and interesting for the training part:

    • -c --passes N .  This specifies to do N passes on the training set while learning the optimal weights. In deep learning, the term epoch is often used instead of pass, and basically represents a full pass over the whole training set to update the weights. Doing several passes often leads to stronger models but the ideal number of passes can be tuned as an hyper parameter.  Note that the  -c option specifying to use caching is necessary when doing multiple passes because from the second pass, VW is using pre-compiled information that it prepared/cached during the first pass.
    • -b N  . The -b option allows you to control the number of bits in the hashing namespace (c.f part 2 of this series to understand what is the hashing trick ) and set it to \(2^N\) . The default value for N is 18, which might be more than ok (e.g. for the toy bank dataset) or not enough depending on the cardinality of your features values. If you need to encode  features having an high cardinality, i.e. a lot of different values like e.g. a product id in a catalog of millions of product, or, more frequently, if you need to create interactions of features (i.e. the cartesian product of two features values) which is also often leading to an high cardinality features, then you’ll probably need to increase N. Obviously the higher it is, the less collisions you’ll have in your namespace, but the more memory you’ll need.
    • --interactions arg . This is a very powerful one. Basically  arg is a list of letters, and each letter represents a namespace (assuming you organised your features around namespaces, like e.g. in our example in previous section). Applying that option means that it will automatically create interactions between all features in the corresponding namespaces. For instance, in our example above, adding e.g.  --interactions ic   will instantly create a whole bunch of new features in the model: all the interactions pairs between features in the namespace i and in the namespace c . Note that in this case the option is equivalent to --quadratic ic but the --interactions option is more general as it allows to create not only quadratic interactions but even more (triplets, quadruplets etc…). Such a feature somehow allows you to get closer to factorization machine models.

So here is an example of training command using those parameters:

vw train.vw -c --passes 4 -f model.vw --loss_function logistic --interactions ci -b 26

Using VW from a python Jupyter Notebook

A lot of ML engineers/data scientists nowadays (including myself) are using jupyter notebooks to explore/play with/compare various models interactively right from the notebook, thanks to the huge ML ecosystem we have in python (scikit-learn, keras, etc…) . While VW command line is nice, i still wanted to be able to play with it from a notebook, to easily control the train/test split, graph the results, switch between datasets, compare to other algos/libraries etc…

There are some python wrappers for VW (e.g. here) but they are either painful to install or slower. So i used a  less clean yet very practical solution: calling VW as an external command from the notebook and loading the results of the training via the output file. See a full example below. Feel free to run it in your own notebook, you’ll only need to specify the right path and have a training set in the VW format. Here again i used the banking training set in VW format (re-sharing the link here) that was generated by the tool i presented previously.

Therefore, it’s very important viagra prices http://appalachianmagazine.com/2019/12/10/not-so-cozy-cold-winters-in-cabins-by-the-fireplace/ to detox your body regularly. He faces this condition buy brand cialis because ofextremely softness and small size of his organ. It is widely recommended for the treatment of ED and pushed all sorts of pills and other medications that claim to help – but true ED goes 100mg tablets of viagra http://appalachianmagazine.com/author/appalachianmagazine/page/60/ far beyond what a ‘miracle pill’ can fix. Chronic health problems can become an issue along order viagra from canada with disability or disease like benign prostatic hyperplasia.

Once you’re in the python ecosystem, you can feel at home, use any of the libraries you’re familiar with, e.g. calculating the AUC as we did above, or e.g plotting the ROC curve (as a continuity of previous notebook, see below). Bottom line: the sky (or maybe your python skills) is the limit :).

Note that an AUC of 0.91 on the test set is very respectable. Well, we used a rather simple dataset here, mainly to illustrate the concepts more easily, but i played with much less trivial datasets as well, having hundreds of sparse features and hundreds of gigabytes of data, and VW in most cases eats them for breakfast and gives very strong results.

Auditing the weights of your model

When you want to debug your model to check if something is wrong, VW proposes a very nice auditing option :  -a . It also allows to explore how VW is representing the core info of your model behind the scene. Let’s use that option with the following command:

vw -d train.vw -f model.vw --loss_function logistic --link=logistic -p probs.txt -a > weights_details.txt

Note that the -a option is not working if you also have the --interactions option in the same command. If you open the weights_details.txt file, a typical line will look like this:

-2.546350

       i^F11:137987:380:-0.000686432@20293.9   c^F15=unknown:86264:1:-0.241524@0.414363        Constant:116060:1:-0.241524@0.414363    i^F12:217054:1:-0.241524@0.414363       i^F13:200603:-1:0.241524@0.414363       c^F10=may:104323:1:-0.241524@0.414363   c^F9=5:218926:1:-0.241524@0.414363      c^F8=unknown:86079:1:-0.241524@0.414363 c^F6=yes:6939:1:-0.220727@0.39844       i^F0:48942:42:-0.00340787@1112.67       c^F3=tertiary:235513:1:-0.121834@0.26117        c^F1=entrepreneur:69649:1:-0.10903@0.03325   i^F5:165402:2:-5.23114e-05@1.19111e+06  c^F4=yes:211075:1:0@0   c^F2=divorced:209622:1:0@0

In the auditing section of that page, you have the details of each piece of this format, but   let’s analyze one piece of it together, e.g. c^F1=entrepreneur:69649:1:-0.10903@0.03325 :

  • c^ means that the signal is part of the c namespace (this is the categorical namespace, c.f. previous section)
  • F1=entrepreneur .  This is the actual feature value in the format we built, with F1 being the name of the feature (which corresponds to Job in our dataset, c.f. previous section)
  • 69649 is the actual index in the namespace c , i.e. after applying the hash function on the string “F1=entrepreneur” . Note we didn’t use the -b option, thus the default size of each namespace is 2^18 , which is 262144 , and thus the weight of the feature F1=entrepreneur is stored in that namespace at index 69649 .
  • 1 is the value of the feature. For a numerical feature it will be a number, but for categorical values (like here) it is 1 by default.
  • -0.10903 is the actual weight of the feature
  • 0.03325 is the is the sum of gradients squared for that feature. This is used for per coordinate adaptive learning rate (see part 2 of that series for the intuition behind it).

You now might ask, from where comes the number -2.546350 at the beginning of the line? This actually represents the linear sum of the weights for that example, i.e. \(\theta^Tx \) (c.f. part 1 of this series) . A bit tedious, but to convince yourself, you can observe the actual calculation from the above example :

380*-0.000686432 -0.241524 -0.241524 -0.241524 -1*0.241524 -0.241524 -0.241524 -0.241524 -0.220727 +42*-0.00340787 -0.121834 -0.10903 + 2*-0.0000523114

This gives the output -2.546 . Now, this is not the actual final prediction. To get it, you just need to pass it through the logistic function, i.e. \( \frac{1}{1+e^{-\theta^Tx}} \)  (again, c.f. part 1 of this series) , and you obtain \( \frac{1}{1+e^{- (-2.546350)}} = 0.072672 \) . You can find this number in the corresponding line in the probs.txt file (c.f. the -p option in the command line above). Btw, deciding if 0.072672 should end up in a “yes” or “no” prediction depends on the threshold you picked (the optimal threshold could be picked using the ROC curve above, c.f. this post i wrote some time ago for more details about the intuition behind this). 

Explore the weights of your (hashed) signals

One of the first thing i like to check after building a logistic regression model is the weights that each of the signals received. For a categorical feature, each value of the category is getting its own weight. Note that with the hashing trick, this corresponds to the weights stored in an entry of the hash space. But knowing that e.g. entry 3235 of the hash space got a weight of 0.34 is not very useful. What would be useful is to be able to map this hashed entry to an actual real feature value of your dataset. Happily, VW makes that easy for you, via another command line tool called vw-varinfo . Let’s use it on the dataset of the previous section (putting again the link of the VW version of it here). So you can run for instance this command line:

vw-varinfo --loss_function logistic --link=logistic train.vw > weights_details.txt

This will output a file  weights_details.txt for which the first lines look like that:

FeatureName HashVal MinVal MaxVal Weight RelScore
c^F15=success 182344 0.00 1.00 +1.3503 100.00%
c^F8=cellular 52869 0.00 1.00 +0.1913 14.16%
c^F6=no 182486 0.00 1.00 +0.1777 13.16%
c^F4=no 88500 0.00 1.00 +0.1759 13.03%

This represents the weights of each feature from the highest to the lowest. For instance, the feature that got the highest weight is c^F15=success with a weight of ~1.35 . F15 is the short name given by the dataset creator tool presented in a previous section above. To know to which feature it corresponds to, you can open the feature name mapping also created by the same tool  (see file featuresIndexes.txt in the zip file provided above). There you’ll see that F15 corresponds to the feature poutcome .  And as per the dataset descriptionpoutcome corresponds to “outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)”.  So, that makes sense that it would get some high weight. The second one is c^F8=cellular . Using the same process you can see that F8 corresponds to contact , which is described as “contact communication type (categorical: ‘cellular’,’telephone’) “. Obviously, having the cellular phone number of the customer rather than his landline significantly increases the chances for the bank to contact him at all, so it make sense as well that such a feature would get an higher weight.

A very nice aspect of the vw-varinfo command is that it supports advanced options like e.g.  --interactions   . I.e you can run this for example:

vw-varinfo -c --passes 4 --interactions ci -b 26 --loss_function logistic --link=logistic train.vw > weights_details.txt

In this case, you’ll be able to observe the weights of feature interactions, e.g.  c^F9=28*i^F14 .

Btw, to be able to give an intuitive interpretation of the weights created by the model, check again part 1 of this series ;-).

Conclusion

By now, if you made it through all the posts of this series,  hopefully logistic regression don’t have much secrets to you anymore.

We’ve described the core theoretical foundation of the model and how to interprets the learned weights  (in part 1),  described  techniques that makes it work at scale in practice like the hashing trick and per coordinate learning rate and how it can be all implemented in 30 lines of code  (in part 2) and, in this post, how to use a very powerful general purpose machine learning library (Vowpal Wabbit) to build state of the art logistic regression models. We also introduced a simple helper tool to transform your standard tabular binary classification datasets into the Vowpal Wabbit format to be able to use this powerful librairy even more easily.

I hope you’re now convinced how simple yet powerful is logistic regression and thus why it is so important to master it as part of the standard set of tools of the modern data scientist/machine learning practitioner. See you in future posts!

2 thoughts on “Deep Dive Into Logistic Regression: Part 3

  1. Hi Philippe,

    Thank you so much for a detailed series on logistic regression – VW combination. It is exactly what I was looking for.

    Can you please help me understand the following?
    “0.03325 is the is the sum of gradients squared for that feature. This is used for per coordinate adaptive learning rate (see part 2 of that series for the intuition behind it)”

    I did not see any sum of gradient square in the 2nd part. The update to the per-coordinate learning rate is of the form (y_pred-y)*alpha/sqrt(count)+1. How does the square of the gradient fir into this scheme? Am I missing some detail here? Please explain.

    On the other note, have you considered publishing this series on Medium.com? It will reach and be helpful to a much wider audience on that platform.

    Thank you, looking forward to your explanation.

    1. Hi Sameer. Happy that it helps.
      For the sum of gradients, i’m referring to that VW documentation page: https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Audit (see ssgrad: the sum of squared gradients (plus 1) if adaptive updates are used).
      Regarding the intuition, i meant around adaptive updates in general. Momentum itself is not using squared gradients specifically, but other adaptive method does (you can read more on that on my other post http://www.philippeadjiman.com/blog/2018/11/03/visualising-sgd-with-momentum-adam-and-learning-rate-annealing/ or even better, by going over the amazing post by Sebastien Ruder here: http://ruder.io/optimizing-gradient-descent/index.html ) . Hope this helps.

      Regarding medium, it is probably a good idea, but i just didn’t get to it yet 🙂 as it requires not few conversions of the post into another format in order for it to look readable. Not a big deal, but i need to allocate some time for it 🙂 , thanks for suggesting!

Leave a Reply

Your email address will not be published. Required fields are marked *