In the first post of this series, we set the theoretical foundation of logistic regression. In this post we’ll explore the different techniques that make it work at scale and we’ll even go over a remarkably simple yet massively effective and scalable implementation (in python) connecting all the theoretical and practical concepts in only 30 lines of code.

## Signals representation: the Hashing Trick

In the previous post, we presented a canonical way of representing the signals (a.k.a. features) of a given training example \( (x^{(i)} , y_i)\) . We said that \(x^{(i)} ∈ \{0, 1\}^d\) is a sparse binary feature vector in a d-dimensional space. How does it connect to reality?

Let’s say that you are trying to predict if a person will be willing to make a donation based on a list of previous donators (c.f. this dataset with which we’ll play in the next part of this series). You have a lot of signals around the previous donators like e.g. their location (State/Zipcode), their title (Mr. , Dr., Professor, etc), their socio-economic status and hundreds of others (c.f. the full list here).

Each signal has its own cardinality, e.g. the socio-economic status can take only three values (High/Average/Low), while the Zipcode can take up to ~43000 values (according to that page). The standard way to represent such categorical features for classification/regression problems is called one-hot encoding. For example, for the socio-economic status , you’d have a vector of three bits, one for each possible value. For “High” the encoding would give (1,0,0) , for “Average” (0,1,0) and for “Low” (0,0,1). For the US state you’d have 50 bits, and for the title you’d have 100 bits (their are 100 different titles in the fields description page).

For Zip Code? Hmm it could start to be overkilled/cumbersome to have a vector of ~43000 bits with only one 1 in it. What about a feature that is unbounded like e.g. an email? And when you connect all those “one-hot encoded” features together, you’re left with a pretty huge sparse binary vector of signals, potentially with much more entries (and thus weights to learn) than the number of training examples in your training set, which would be doomed to overfitting.

Meet the hashing trick.

The hashing trick was first introduced in that paper (by the creators of Vowpal Wabbit that we’ll cover extensively in our next blog post). Its principle is pretty simple: instead of having a huge sparse binary vector (of potentially unbounded size) for representing the signals, project each signal values into a much smaller fixed size binary vector. How? by hashing each signal values into that smaller vector. Let’s illustrate this on a simple toy example.

Suppose your training example has 3 features: Socio-economic level, US state, and Title, and that we’re trying to encode “social=High, State=WI (Wisconsin) and title=Mayor ” . With one hot encoding, it would look like this:

With the hashing trick, let’s suppose we pick a fixed size of 50 for the underlying vector, the idea would be to hash each feature and apply a modulo of the size of the underlying vector (here 50), and send it to the corresponding cell of the vector. On our example, the encoding of the signals would look like this:

Few remarks around this example and the hashing trick in general:

- In our example, we reduced the size of the vector by about a factor of 3 (from 153 to 50), but in real life examples it can go much beyond that and with much higher numbers (say from hundreds of of millions to tens of millions).
- This is mainly happening when you are doing feature interactions, i.e. when you take two features of very high cardinality and combine them together to have a new feature.
- Obviously, when reducing that much the feature space, collisions might (and will) happen (i.e. two different values hashing to the same bucket). Surprisingly, when the size of the feature space is still big enough, the impact of those collisions is negligible at best and acceptable at worst (some more details about it in that paper section 4.4, or in that very nice blog post).
- One could think there are obviously much smarter ways to reduce feature space or reducing/mitigating collisions, and there are, but most of them are involving keeping data structures and mappings in-memory, defeating the biggest added value of feature hashing which allows to process huge dataset with a tiny memory footprint (which is more or less the size of the hashed feature vector).
- You can do use some tricks to try minimizing collisions while still keeping the feature hashing trick, like e.g hashing features in different feature space (to e.g. make sure that two important values like for example two countries getting hashed to the same place and thus getting the same weight).
- Note that in Vowpal Wabbit, you can easily use one different hashed feature space (called namespace) for different subsets of the signals. Each namespace has its own hash index, and each signal belonging to it get hashed with a different hash function.
- Ideally, always treat the hashed feature space size as an hyper-parameter and tune/cross validate it by observing its impact on your loss.

## Per-coordinate Adaptive Learning Rate

In the first post of this series, we mentioned the notion of learning rate \( \alpha \) in the gradient descent step. This is controlling the rate at which we’re updating the weights of each parameter in the direction of the gradient. The lower is that rate, the slower will be the gradient descent (as it will only do baby steps in weights updates). The higher it is, the faster will be the gradient descent (as it will do big jumps in the direction of the gradient) but with a risk of missing the target and diverging.

One idea would be to adapt the learning rate along the way. Intuitively, we could have some high learning rate at the beginning of the gradient descend, and reduce it along the way while it is getting closer to its target (to not miss it). Even more than that, we could adapt the learning rate specifically for each specific parameter (weight) separately ! This is what per-parameter (or per-coordinate) adaptive learning rate is all about.

A lot of research was (and is still) done around that topic, specifically in deep learning, around different variation that would optimally adapt the learning rate per-coordinate during gradient descent (first popular proposed method is AdaGrad and some more recent tunings like RMSProp and Adam, more on them here or here ).

So, intuitively, the idea would be to make the learning rate smaller as a function of how much data was observed for the specific corresponding parameter. A simple example of such an idea (inspired from the code we’ll present in the next section and from that paper from google) would be to set the learning rate \( \alpha_c\) of coordinate \(c\) :

$$ \alpha_c = \frac{\alpha}{\sqrt{n_c}+1} $$

where \( \alpha \) is a general learning rate and \(n_c\) is simply a count of the number of times the coordinate \( c\) was observed in the data.

Mapping that to our example above, let’s assume a global learning rate \(\alpha = 0.1\) and that after 100k examples, the value “Title=Mayor” was observed only twice, while the value “Title=Mr” was obviously observed many more times (say 50k). Then the learning rate of the coordinate “Title=Mayor” will be \( 0.1/(\sqrt(2)+1) \approx 0.041 \) while the learning rate of coordinate “Title=Mr” will be \( 0.1/(\sqrt(50000)+1) \approx0.00045 \) .

observing a training example with the value “Title=Mayor” will obviously be much more rarely observed (than e.g. training examples with the value “Title=Mr”) and thus each time you observe it you want to do big jumps in the update of its weights (and for “Title=Mr”, it would be observed so often than the learning rate would decrease quickly). This is much more powerful than having a global learning rate for all learned parameters.

## Connecting it all in 30 lines of python

Probably the most popular optimization method in machine learning is (Stochastic) Gradient Descent (SGD). But we won’t cover the intuition/details behind it, there are already (too?) many ressources to learn that 🙂 (if you’re interested by the basics, maybe start here and follow some links there for more details).

SGD basically boils down to a simple formula updating weights, so it is not surprising that its implementation is compact. But here i want to illustrate how beautifully simple yet powerful it is to connect all the dots and obtain an highly scalable implementation of Stochastic Gradient Descent for Logistic regression able to learn millions of parameters using the hashing trick and per-coordinate adaptive learning rate with a tiny memory footprint, all that in a bunch of lines of code. To do so, i’ll use the very clear and elegant implementation by @tinrtgu that he shared as part of a kaggle competition. Here we expose only the core code that is useful for our purpose (and not the code that read/writes the training set and produce a prediction file etc..). For the full version of his code (which is not much longer and include more comments ), check it here.

### This code is from https://kaggle2.blob.core.windows.net/forum-message-attachments/53646/1539/fast_solution.py from math import exp, log, sqrt D = 2 ** 20 # number of weights use for learning alpha = .1 # learning rate for sgd optimization def logloss(p, y): p = max(min(p, 1. - 10e-12), 10e-12) return -log(p) if y == 1. else -log(1. - p) def get_x(csv_row, D): x = [0] # 0 is the index of the bias term for key, value in csv_row.items(): index = int(value + key[1:], 16) % D # weakest hash ever 😉 x.append(index) return x def get_p(x, w): wTx = 0. for i in x: # do wTx wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid def update_w(w, n, x, p, y): for i in x: w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) n[i] += 1. return w, n

Let’s go over it piece by piece.

D = 2 ** 20 # number of weights use for learning alpha = .1 # learning rate for sgd optimization

The variable **D** is the size of the feature vector we discussed above. Each cell of that (very sparse) vector will be either 0 or 1 for each training example (c.f. example in the first section). Note that \( 2^{20} \) is about 1 million, but the same code could easily work with say \( 2^{25} \) and learn 33+ millions of parameters. The only limit is your memory size and the size of your training set (you need to have much more training examples than parameters to learn otherwise you’re doomed to overfit ). The variable **alpha** is the global learning rate that we mentioned above.

def logloss(p, y): p = max(min(p, 1. - 10e-12), 10e-12) return -log(p) if y == 1. else -log(1. - p)

This is the log loss function (c.f. the beginning of Part 1 of this series) where **p** is the predicted value and **y** is the actual real value (in our case either 0 or 1) . Note the clipping of the predicted value **p** into the range \( [\epsilon , 1-\epsilon] \) with \( \epsilon = 10^{-12}\) . This is to avoid undefined values in case the prediction **p** would be either 0 or 1 (as it could lead to a loss of log(0) ).

def get_x(csv_row, D): x = [0] # 0 is the index of the bias term for key, value in csv_row.items(): index = int(value + key[1:], 16) % D # weakest hash ever 😉 x.append(index) return x

This function is the core of the hashing trick. Indeed, it transforms the original training example (variable **csv_row**) into its hashed version. The training example here is assumed to be a dictionary of key-value pairs. If we take our previous training example “social=High, State=WI and title=Mayor ” , then we would have **csv_row = {‘social’: ‘High’, ‘State’: ‘WI’, ‘title’: ‘Mayor’}** . Each key-value pair is then hashed. The hash function used here is very weak as @tinrtgu writes in his comment and is subject to an high collision rate, but it is easy to replace it by some stronger (collisions resistant) hash functions like e.g. murmurhash ( @tinrtgu goal here was to have a code not using even one external library).

Note that the function does not return a big vector of size D, but just the list of the indexes of the 1s in the vector (in our example at most 3 if no collisions). The is an efficient way to represent an highly sparse vector full of million(s) of 0s and having only a bunch of 1s.

def get_p(x, w): wTx = 0. for i in x: # do wTx wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid

Given a training example **x** (built with the previous function) and the current vector of weights **w** (of size D, see next function to see how it is updated), this function compute the prediction for this training example **x**.

If you remember from my previous post, the prediction in logistic regression is:

$$h_{\theta}(x) = \frac{1}{1+e^{-\theta^Tx}} $$

\( \theta \) corresponds to the weight vector **w** and \( \theta^Tx \) is the linear combination calculated in the for loop. Note that it is of course not necessary to go compute the linear combination over the whole weights vector **w** but just for the weights of the corresponding features that are present in the training example. Note as well that in the feature hashing representation, the presence of a signal is simply represented by a 1 , and thus, the linear combination is nothing else than the sum of the weights (this was done to make the code simple to read, but technically, numerical features can also be represented and in which case it won’t a just a 0 or 1, but the actual numerical feature). In our running example (“social=High, State=WI and title=Mayor “) , the linear combination would simply correspond to the sum of the 3 weights associated with those 3 signal values.

The actual prediction then simply corresponds to plug that linear combination into the sigmoid function (see above formula). Note that in the code, the linear combination is clipped into the range [-20,20] to avoid numerical overflows (in computing \( e^{-\theta^Tx} \) ) and because sigmoid(-20) and sigmoid(20) are respectively so close from 0 and 1 that it doesn’t have any impact anyway.

def update_w(w, n, x, p, y): for i in x: w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) n[i] += 1. return w, n

Alright, now we reach the heart of the whole thing: the actual update of the weight vector **w** using the current training example **x**, the prediction on it (**p**) and the actual value **y** (the supervision in our training set, typically 0 or 1).

Again, if you remember from my previous post, the update rule for a given weight \( \theta_j \) is:

$$ \theta_j = \theta_j \thinspace – \alpha \sum\limits_{i=1}^{N} ( h_{\theta}(x^{(i)}) – y_i )x_j^{(i)} $$

But this rule is defined for the whole training set in one shot (N is the number of training examples). In SGD, the weights gets updated for a bunch of examples rather than for the whole training set, and specifically in the implementation here, they get updated one training example at a time. In other words, given a training example **x** the update of weight \( \theta_j \) more corresponds to:

$$ \theta_j = \theta_j \thinspace – \alpha \thinspace (h_{\theta}(x ) – y_i ) x_j $$

Note that in our case, \(x_j \) is always 0 or 1 (depending if the corresponding signal is present or not in the training example), so when we update \(\theta_j \) , \(x_j \) will always be 1 and can thus be removed in the equation above.

Now, regarding the learning rate \( \alpha \) , we mentioned above that instead of a general global learning rate, we want a per-coordinate adaptive one, \( \alpha_c\) , for the coordinate (weight/signal) \(c\) . But replacing \( \alpha \) by the formula we mentioned previously, we obtain:

$$\theta_j = \theta_j – \alpha \frac{ h_{\theta}(x ) – y_i }{\sqrt{n_{\theta_j}}+1} $$

In our case, \( n_{\theta_j} \) is simply the number of times that this weight was updated so far (equivalently how many time its corresponding signal was observed in the training set) and this number is maintained in a simple array (the **n** variable in the code) of size D, with **n[i]** being that counter. And \( h_{\theta}(x ) \) is nothing else than the prediction (produced by the **get_p** function). So yes, the equation above is exactly the line of code performing the gradient update step:

w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)

This line of code is obviously the most important one, where all the magic happens. Beautiful isn’t it? 😀

Note that this implementation is what is commonly called “out-of-core”, meaning that it does not require to load the whole training set in memory, making it thus massively scalable.

I hope that at this stage you have a deep understanding of the theoretical foundation and practical aspects around logistic regression.

In the next post (yet to be written 🙂 ), we’ll go over a very popular and highly scalable library implementing all the tricks we saw above and more: Vowpal Wabbit. Stay tuned.

Beautiful pair of articles.

Kudos!

I will try to implement these things soon,

Do you know a reference for a full exercise\blog post containing use of the hashing method?

Thanks!

Thanks for the feedback Tomer! Glad you liked them 🙂 (and stay tuned for part 3 that will be out soon i hope).

About implementing the hashing trick, it is actually very simple, the code exposed in the post demonstrates it (i also gave a link to a more detailed page in the post about the orignal code by @tinrtgu , you have a bit more context and variations there). But again the core idea of the hashing trick is really simple (yet often very powerful in sparse settings).

Hope this helps.