Vowpal Wabbit (VW) is a general purpose machine learning library which is implementing, among other things, logistic regression with the same ideas we presented in our previous post like the hashing trick and per-coordinate adaptive learning rates (in fact, the hashing trick was made popular by that library). A big advantage of Vowpal Wabbit is that it is blazing fast. Not only because its underlying implementation is in C++, but also because it is using the L-BFGS optimization method. L-BGFS stand for “Limited-memory Broyden–Fletcher–Goldfarb–Shanno” and basically approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method using a limited amount of memory. This method is much more complex to implement than Stochastic Gradient descent (which can be implemented in few lines of code as we saw in our previous post), but is supposedly converging faster (in less iterations). If you want to read more about L-BFGS and/or understand its difference with other optimisation methods, you can check this (doc from Vowpal Wabbit) or this (nice blog post). Note that L-BFGS was empirically observed to be superior to SGD in many cases, in particular in deep learning settings (check out that paper on that topic).
Many times, i’ve heard people giving up on Vowpal Wabbit because of its input format, even after going quickly over its documentation . So let’s try to present it through a toy (yet real) example that will be used throughout this post to illustrate the main concepts of Vowpal Wabbit. On top of it, i’ll provide an helper tool (in next section) allowing to transform your tabular dataset into the VW input format easily.
So, the dataset we’ll use can be found here and represents the attempt of a bank trying to predict if a marketing phone call will end up in a bank term deposit by the customer, based on a bunch of signals like socio-economic factors of the customer like “does he have a loan?”, etc..
The traditional way to represent such datasets is to have a tsv or csv file, with the header being the name of the signals and each line representing the value of the training example on each signal. Each line of the training set has thus a fixed size, and missing values are just a blank cell or some specific value to indicate that it’s missing. Typically, for that dataset, the header looks like that:
age;job;marital;education;default;balance;housing;loan;contact;day;month;duration;campaign;pdays;previous;poutcome;y
With y being the actual supervision (i.e. did the call ended up in bank term deposit). And a typical training example looks like that:
58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no
In Vowpal Wabbit, there is no header, and each signal name is embedded in the training example itself. For example, the training example above can look like that in Vowpal Wabbit format:
-1 |i age:58 balance:2143 duration:261 campaign:1 pdays:-1 previous:0 |c job=management marital=married education=tertiary default=no housing=yes contact=unknown day=5 month=may poutcome=unknown
Let’s discuss multiple important things there:
--quadratic ic
would combine all the features of the namespaces i and c in our example above to create on the fly 2-way interacting features . For instance the value of age and job together would be a new signal (maybe if you are a certain age in a certain profession, you’re more or less likely to do a bank term deposit).Most often, classification or regression training sets are coming in the form of TSV or CSV files as mentioned previously. Transforming them into the VW input format is not difficult, but it does require a minimum of attention. Indeed, depending on the training set, the target variable (or label) might be a word like “yes/no” or a number like “1/0” , while VW requires it to be -1/1 . Also, if a signal is numerical or categorical, in VW you need to transform it into different things (using e.g ‘:’ for numerical features, c.f remarks in previous section).
I wrote a very simple java (8) class that does this, find it here and feel free to use it. You’ll just need to create (or edit an existing) method there to set up the characteristics of your data set. It doesn’t use any external library (other than pure java 8 libraries) so you if don’t have a Java IDE already installed, you can easily edit it from a text editor and compile/run it from command line.
Then you simply specify:
All those are specified as parameters inside a side method that you just invoke in the main (which is then invoking the core method of that class calledtabular2VWGenerator
). You have two examples of such side methods in the code: generateBankTraningSet
representing our dataset discussed above and generateDonationTrainingSet
representing another more complex dataset with a lot a sparse features (check the full list here). You invoke the appropriate method from the main.
The program then parses each line of the original training set, and, based on the list of numerical/categorical variables names, will generates two files:
Important note: as in the example in previous section, the program is separating the numerical and categorical features into two namespaces (respectively named i and c ). You can also decide to put all the features in the same namespaces (c.f. last parameter of the tabular2VWGenerator
method). For our previous example, a training example will look like this:
-1 |i F0:58 F5:2143 F11:261 F12:1 F13:-1 F14:0 |c F1=management F2=married F3=tertiary F4=no F6=yes F8=unknown F9=5 F10=may F15=unknown
Note that the program can easily be enhanced to e.g. support as input multiple lists, each one would represent a namespace, and in the list, you could represent the feature type as a character, e.g. one of the list could look like {“age:n”, “balance:n”, “education:c”} and the program would parse this and know that age is numerical and education is categorical and encode them accordingly. Feel free to modify it!
Once you have your training set in the VW input format, you can start playing around with building some models from the command line. To illustrate it, we’ll take the small dataset we mentioned before about predicting bank term deposit. You can find here the training set in the VW input format and its short name signal mapping (which were created using the tool described in previous section).
Let’s start by a first command to train a logistic regression model:
vw train.vw -f model.vw --loss_function logistic
It is pretty much self explained (-f is to specify the filename of the output mode and – –loss_function specifies which loss function to use, logistic in our case).
The output will show you some useful information on the progress of the training, along with the final obtained loss (average loss = 0.253874 in that case).
Then, to actually use the model on a separate test set (more later on how to easily create one), you simply do:
vw test.vw -t -i model.vw -p preds.txt --link logistic
The -t
option specifies that you’re in test mode and VW will thus ignore the label of the training examples. -i
specifies the model to use (typically the one that was created by the previous training command). --link logistic
says that the logistic regression is applied on top of the linear combinations. Without it, the file preds.txt will contains only the result of \( \theta^Tx \) and not the sigmoid function applied on top of it.
Some options i found useful and interesting for the training part:
-c --passes N
. This specifies to do N passes on the training set while learning the optimal weights. In deep learning, the term epoch is often used instead of pass, and basically represents a full pass over the whole training set to update the weights. Doing several passes often leads to stronger models but the ideal number of passes can be tuned as an hyper parameter. Note that the -c option specifying to use caching is necessary when doing multiple passes because from the second pass, VW is using pre-compiled information that it prepared/cached during the first pass. -b N
. The -b
option allows you to control the number of bits in the hashing namespace (c.f part 2 of this series to understand what is the hashing trick ) and set it to \(2^N\) . The default value for N is 18, which might be more than ok (e.g. for the toy bank dataset) or not enough depending on the cardinality of your features values. If you need to encode features having an high cardinality, i.e. a lot of different values like e.g. a product id in a catalog of millions of product, or, more frequently, if you need to create interactions of features (i.e. the cartesian product of two features values) which is also often leading to an high cardinality features, then you’ll probably need to increase N. Obviously the higher it is, the less collisions you’ll have in your namespace, but the more memory you’ll need.--interactions arg
. This is a very powerful one. Basically arg
is a list of letters, and each letter represents a namespace (assuming you organised your features around namespaces, like e.g. in our example in previous section). Applying that option means that it will automatically create interactions between all features in the corresponding namespaces. For instance, in our example above, adding e.g. --interactions ic
will instantly create a whole bunch of new features in the model: all the interactions pairs between features in the namespace i and in the namespace c . Note that in this case the option is equivalent to --quadratic ic
but the --interactions
option is more general as it allows to create not only quadratic interactions but even more (triplets, quadruplets etc…). Such a feature somehow allows you to get closer to factorization machine models.So here is an example of training command using those parameters:
vw train.vw -c --passes 4 -f model.vw --loss_function logistic --interactions ci -b 26
A lot of ML engineers/data scientists nowadays (including myself) are using jupyter notebooks to explore/play with/compare various models interactively right from the notebook, thanks to the huge ML ecosystem we have in python (scikit-learn, keras, etc…) . While VW command line is nice, i still wanted to be able to play with it from a notebook, to easily control the train/test split, graph the results, switch between datasets, compare to other algos/libraries etc…
There are some python wrappers for VW (e.g. here) but they are either painful to install or slower. So i used a less clean yet very practical solution: calling VW as an external command from the notebook and loading the results of the training via the output file. See a full example below. Feel free to run it in your own notebook, you’ll only need to specify the right path and have a training set in the VW format. Here again i used the banking training set in VW format (re-sharing the link here) that was generated by the tool i presented previously.
Once you’re in the python ecosystem, you can feel at home, use any of the libraries you’re familiar with, e.g. calculating the AUC as we did above, or e.g plotting the ROC curve (as a continuity of previous notebook, see below). Bottom line: the sky (or maybe your python skills) is the limit :).
Note that an AUC of 0.91 on the test set is very respectable. Well, we used a rather simple dataset here, mainly to illustrate the concepts more easily, but i played with much less trivial datasets as well, having hundreds of sparse features and hundreds of gigabytes of data, and VW in most cases eats them for breakfast and gives very strong results.
When you want to debug your model to check if something is wrong, VW proposes a very nice auditing option : -a
. It also allows to explore how VW is representing the core info of your model behind the scene. Let’s use that option with the following command:
vw -d train.vw -f model.vw --loss_function logistic --link=logistic -p probs.txt -a > weights_details.txt
Note that the -a
option is not working if you also have the --interactions
option in the same command. If you open the weights_details.txt file, a typical line will look like this:
-2.546350
i^F11:137987:380:-0.000686432@20293.9 c^F15=unknown:86264:1:-0.241524@0.414363 Constant:116060:1:-0.241524@0.414363 i^F12:217054:1:-0.241524@0.414363 i^F13:200603:-1:0.241524@0.414363 c^F10=may:104323:1:-0.241524@0.414363 c^F9=5:218926:1:-0.241524@0.414363 c^F8=unknown:86079:1:-0.241524@0.414363 c^F6=yes:6939:1:-0.220727@0.39844 i^F0:48942:42:-0.00340787@1112.67 c^F3=tertiary:235513:1:-0.121834@0.26117 c^F1=entrepreneur:69649:1:-0.10903@0.03325 i^F5:165402:2:-5.23114e-05@1.19111e+06 c^F4=yes:211075:1:0@0 c^F2=divorced:209622:1:0@0
In the auditing section of that page, you have the details of each piece of this format, but let’s analyze one piece of it together, e.g. c^F1=entrepreneur:69649:1:-0.10903@0.03325 :
-b
option, thus the default size of each namespace is 2^18 , which is 262144 , and thus the weight of the feature F1=entrepreneur is stored in that namespace at index 69649 .You now might ask, from where comes the number -2.546350 at the beginning of the line? This actually represents the linear sum of the weights for that example, i.e. \(\theta^Tx \) (c.f. part 1 of this series) . A bit tedious, but to convince yourself, you can observe the actual calculation from the above example :
380*-0.000686432 -0.241524 -0.241524 -0.241524 -1*0.241524 -0.241524 -0.241524 -0.241524 -0.220727 +42*-0.00340787 -0.121834 -0.10903 + 2*-0.0000523114
This gives the output -2.546 . Now, this is not the actual final prediction. To get it, you just need to pass it through the logistic function, i.e. \( \frac{1}{1+e^{-\theta^Tx}} \) (again, c.f. part 1 of this series) , and you obtain \( \frac{1}{1+e^{- (-2.546350)}} = 0.072672 \) . You can find this number in the corresponding line in the probs.txt file (c.f. the -p
option in the command line above). Btw, deciding if 0.072672 should end up in a “yes” or “no” prediction depends on the threshold you picked (the optimal threshold could be picked using the ROC curve above, c.f. this post i wrote some time ago for more details about the intuition behind this).
One of the first thing i like to check after building a logistic regression model is the weights that each of the signals received. For a categorical feature, each value of the category is getting its own weight. Note that with the hashing trick, this corresponds to the weights stored in an entry of the hash space. But knowing that e.g. entry 3235 of the hash space got a weight of 0.34 is not very useful. What would be useful is to be able to map this hashed entry to an actual real feature value of your dataset. Happily, VW makes that easy for you, via another command line tool called vw-varinfo
. Let’s use it on the dataset of the previous section (putting again the link of the VW version of it here). So you can run for instance this command line:
vw-varinfo --loss_function logistic --link=logistic train.vw > weights_details.txt
This will output a file weights_details.txt for which the first lines look like that:
FeatureName HashVal MinVal MaxVal Weight RelScore
c^F15=success 182344 0.00 1.00 +1.3503 100.00%
c^F8=cellular 52869 0.00 1.00 +0.1913 14.16%
c^F6=no 182486 0.00 1.00 +0.1777 13.16%
c^F4=no 88500 0.00 1.00 +0.1759 13.03%
This represents the weights of each feature from the highest to the lowest. For instance, the feature that got the highest weight is c^F15=success with a weight of ~1.35 . F15 is the short name given by the dataset creator tool presented in a previous section above. To know to which feature it corresponds to, you can open the feature name mapping also created by the same tool (see file featuresIndexes.txt in the zip file provided above). There you’ll see that F15 corresponds to the feature poutcome . And as per the dataset description, poutcome corresponds to “outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)”. So, that makes sense that it would get some high weight. The second one is c^F8=cellular . Using the same process you can see that F8 corresponds to contact , which is described as “contact communication type (categorical: ‘cellular’,’telephone’) “. Obviously, having the cellular phone number of the customer rather than his landline significantly increases the chances for the bank to contact him at all, so it make sense as well that such a feature would get an higher weight.
A very nice aspect of the vw-varinfo command is that it supports advanced options like e.g. --interactions
. I.e you can run this for example:
vw-varinfo -c --passes 4 --interactions ci -b 26 --loss_function logistic --link=logistic train.vw > weights_details.txt
In this case, you’ll be able to observe the weights of feature interactions, e.g. c^F9=28*i^F14 .
Btw, to be able to give an intuitive interpretation of the weights created by the model, check again part 1 of this series ;-).
By now, if you made it through all the posts of this series, hopefully logistic regression don’t have much secrets to you anymore.
We’ve described the core theoretical foundation of the model and how to interprets the learned weights (in part 1), described techniques that makes it work at scale in practice like the hashing trick and per coordinate learning rate and how it can be all implemented in 30 lines of code (in part 2) and, in this post, how to use a very powerful general purpose machine learning library (Vowpal Wabbit) to build state of the art logistic regression models. We also introduced a simple helper tool to transform your standard tabular binary classification datasets into the Vowpal Wabbit format to be able to use this powerful librairy even more easily.
I hope you’re now convinced how simple yet powerful is logistic regression and thus why it is so important to master it as part of the standard set of tools of the modern data scientist/machine learning practitioner. See you in future posts!
]]>In the previous post, we presented a canonical way of representing the signals (a.k.a. features) of a given training example \( (x^{(i)} , y_i)\) . We said that \(x^{(i)} ∈ \{0, 1\}^d\) is a sparse binary feature vector in a d-dimensional space. How does it connect to reality?
Let’s say that you are trying to predict if a person will be willing to make a donation based on a list of previous donators (c.f. this dataset with which we’ll play in the next part of this series). You have a lot of signals around the previous donators like e.g. their location (State/Zipcode), their title (Mr. , Dr., Professor, etc), their socio-economic status and hundreds of others (c.f. the full list here).
Each signal has its own cardinality, e.g. the socio-economic status can take only three values (High/Average/Low), while the Zipcode can take up to ~43000 values (according to that page). The standard way to represent such categorical features for classification/regression problems is called one-hot encoding. For example, for the socio-economic status , you’d have a vector of three bits, one for each possible value. For “High” the encoding would give (1,0,0) , for “Average” (0,1,0) and for “Low” (0,0,1). For the US state you’d have 50 bits, and for the title you’d have 100 bits (their are 100 different titles in the fields description page).
For Zip Code? Hmm it could start to be overkilled/cumbersome to have a vector of ~43000 bits with only one 1 in it. What about a feature that is unbounded like e.g. an email? And when you connect all those “one-hot encoded” features together, you’re left with a pretty huge sparse binary vector of signals, potentially with much more entries (and thus weights to learn) than the number of training examples in your training set, which would be doomed to overfitting.
Meet the hashing trick.
The hashing trick was first introduced in that paper (by the creators of Vowpal Wabbit that we’ll cover extensively in our next blog post). Its principle is pretty simple: instead of having a huge sparse binary vector (of potentially unbounded size) for representing the signals, project each signal values into a much smaller fixed size binary vector. How? by hashing each signal values into that smaller vector. Let’s illustrate this on a simple toy example.
Suppose your training example has 3 features: Socio-economic level, US state, and Title, and that we’re trying to encode “social=High, State=WI (Wisconsin) and title=Mayor ” . With one hot encoding, it would look like this:
With the hashing trick, let’s suppose we pick a fixed size of 50 for the underlying vector, the idea would be to hash each feature and apply a modulo of the size of the underlying vector (here 50), and send it to the corresponding cell of the vector. On our example, the encoding of the signals would look like this:
Few remarks around this example and the hashing trick in general:
In the first post of this series, we mentioned the notion of learning rate \( \alpha \) in the gradient descent step. This is controlling the rate at which we’re updating the weights of each parameter in the direction of the gradient. The lower is that rate, the slower will be the gradient descent (as it will only do baby steps in weights updates). The higher it is, the faster will be the gradient descent (as it will do big jumps in the direction of the gradient) but with a risk of missing the target and diverging.
One idea would be to adapt the learning rate along the way. Intuitively, we could have some high learning rate at the beginning of the gradient descend, and reduce it along the way while it is getting closer to its target (to not miss it). Even more than that, we could adapt the learning rate specifically for each specific parameter (weight) separately ! This is what per-parameter (or per-coordinate) adaptive learning rate is all about.
A lot of research was (and is still) done around that topic, specifically in deep learning, around different variation that would optimally adapt the learning rate per-coordinate during gradient descent (first popular proposed method is AdaGrad and some more recent tunings like RMSProp and Adam, more on them here or here ).
So, intuitively, the idea would be to make the learning rate smaller as a function of how much data was observed for the specific corresponding parameter. A simple example of such an idea (inspired from the code we’ll present in the next section and from that paper from google) would be to set the learning rate \( \alpha_c\) of coordinate \(c\) :
$$ \alpha_c = \frac{\alpha}{\sqrt{n_c}+1} $$
where \( \alpha \) is a general learning rate and \(n_c\) is simply a count of the number of times the coordinate \( c\) was observed in the data.
Mapping that to our example above, let’s assume a global learning rate \(\alpha = 0.1\) and that after 100k examples, the value “Title=Mayor” was observed only twice, while the value “Title=Mr” was obviously observed many more times (say 50k). Then the learning rate of the coordinate “Title=Mayor” will be \( 0.1/(\sqrt(2)+1) \approx 0.041 \) while the learning rate of coordinate “Title=Mr” will be \( 0.1/(\sqrt(50000)+1) \approx0.00045 \) .
observing a training example with the value “Title=Mayor” will obviously be much more rarely observed (than e.g. training examples with the value “Title=Mr”) and thus each time you observe it you want to do big jumps in the update of its weights (and for “Title=Mr”, it would be observed so often than the learning rate would decrease quickly). This is much more powerful than having a global learning rate for all learned parameters.
Probably the most popular optimization method in machine learning is (Stochastic) Gradient Descent (SGD). But we won’t cover the intuition/details behind it, there are already (too?) many ressources to learn that (if you’re interested by the basics, maybe start here and follow some links there for more details).
SGD basically boils down to a simple formula updating weights, so it is not surprising that its implementation is compact. But here i want to illustrate how beautifully simple yet powerful it is to connect all the dots and obtain an highly scalable implementation of Stochastic Gradient Descent for Logistic regression able to learn millions of parameters using the hashing trick and per-coordinate adaptive learning rate with a tiny memory footprint, all that in a bunch of lines of code. To do so, i’ll use the very clear and elegant implementation by @tinrtgu that he shared as part of a kaggle competition. Here we expose only the core code that is useful for our purpose (and not the code that read/writes the training set and produce a prediction file etc..). For the full version of his code (which is not much longer and include more comments ), check it here.
### This code is from https://kaggle2.blob.core.windows.net/forum-message-attachments/53646/1539/fast_solution.py from math import exp, log, sqrt D = 2 ** 20 # number of weights use for learning alpha = .1 # learning rate for sgd optimization def logloss(p, y): p = max(min(p, 1. - 10e-12), 10e-12) return -log(p) if y == 1. else -log(1. - p) def get_x(csv_row, D): x = [0] # 0 is the index of the bias term for key, value in csv_row.items(): index = int(value + key[1:], 16) % D # weakest hash ever x.append(index) return x def get_p(x, w): wTx = 0. for i in x: # do wTx wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid def update_w(w, n, x, p, y): for i in x: w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) n[i] += 1. return w, n
Let’s go over it piece by piece.
D = 2 ** 20 # number of weights use for learning alpha = .1 # learning rate for sgd optimization
The variable D is the size of the feature vector we discussed above. Each cell of that (very sparse) vector will be either 0 or 1 for each training example (c.f. example in the first section). Note that \( 2^{20} \) is about 1 million, but the same code could easily work with say \( 2^{25} \) and learn 33+ millions of parameters. The only limit is your memory size and the size of your training set (you need to have much more training examples than parameters to learn otherwise you’re doomed to overfit ). The variable alpha is the global learning rate that we mentioned above.
def logloss(p, y): p = max(min(p, 1. - 10e-12), 10e-12) return -log(p) if y == 1. else -log(1. - p)
This is the log loss function (c.f. the beginning of Part 1 of this series) where p is the predicted value and y is the actual real value (in our case either 0 or 1) . Note the clipping of the predicted value p into the range \( [\epsilon , 1-\epsilon] \) with \( \epsilon = 10^{-12}\) . This is to avoid undefined values in case the prediction p would be either 0 or 1 (as it could lead to a loss of log(0) ).
def get_x(csv_row, D): x = [0] # 0 is the index of the bias term for key, value in csv_row.items(): index = int(value + key[1:], 16) % D # weakest hash ever x.append(index) return x
This function is the core of the hashing trick. Indeed, it transforms the original training example (variable csv_row) into its hashed version. The training example here is assumed to be a dictionary of key-value pairs. If we take our previous training example “social=High, State=WI and title=Mayor ” , then we would have csv_row = {‘social’: ‘High’, ‘State’: ‘WI’, ‘title’: ‘Mayor’} . Each key-value pair is then hashed. The hash function used here is very weak as @tinrtgu writes in his comment and is subject to an high collision rate, but it is easy to replace it by some stronger (collisions resistant) hash functions like e.g. murmurhash ( @tinrtgu goal here was to have a code not using even one external library).
Note that the function does not return a big vector of size D, but just the list of the indexes of the 1s in the vector (in our example at most 3 if no collisions). The is an efficient way to represent an highly sparse vector full of million(s) of 0s and having only a bunch of 1s.
def get_p(x, w): wTx = 0. for i in x: # do wTx wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid
Given a training example x (built with the previous function) and the current vector of weights w (of size D, see next function to see how it is updated), this function compute the prediction for this training example x.
If you remember from my previous post, the prediction in logistic regression is:
$$h_{\theta}(x) = \frac{1}{1+e^{-\theta^Tx}} $$
\( \theta \) corresponds to the weight vector w and \( \theta^Tx \) is the linear combination calculated in the for loop. Note that it is of course not necessary to go compute the linear combination over the whole weights vector w but just for the weights of the corresponding features that are present in the training example. Note as well that in the feature hashing representation, the presence of a signal is simply represented by a 1 , and thus, the linear combination is nothing else than the sum of the weights (this was done to make the code simple to read, but technically, numerical features can also be represented and in which case it won’t a just a 0 or 1, but the actual numerical feature). In our running example (“social=High, State=WI and title=Mayor “) , the linear combination would simply correspond to the sum of the 3 weights associated with those 3 signal values.
The actual prediction then simply corresponds to plug that linear combination into the sigmoid function (see above formula). Note that in the code, the linear combination is clipped into the range [-20,20] to avoid numerical overflows (in computing \( e^{-\theta^Tx} \) ) and because sigmoid(-20) and sigmoid(20) are respectively so close from 0 and 1 that it doesn’t have any impact anyway.
def update_w(w, n, x, p, y): for i in x: w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) n[i] += 1. return w, n
Alright, now we reach the heart of the whole thing: the actual update of the weight vector w using the current training example x, the prediction on it (p) and the actual value y (the supervision in our training set, typically 0 or 1).
Again, if you remember from my previous post, the update rule for a given weight \( \theta_j \) is:
$$ \theta_j = \theta_j \thinspace – \alpha \sum\limits_{i=1}^{N} ( h_{\theta}(x^{(i)}) – y_i )x_j^{(i)} $$
But this rule is defined for the whole training set in one shot (N is the number of training examples). In SGD, the weights gets updated for a bunch of examples rather than for the whole training set, and specifically in the implementation here, they get updated one training example at a time. In other words, given a training example x the update of weight \( \theta_j \) more corresponds to:
$$ \theta_j = \theta_j \thinspace – \alpha \thinspace (h_{\theta}(x ) – y_i ) x_j $$
Note that in our case, \(x_j \) is always 0 or 1 (depending if the corresponding signal is present or not in the training example), so when we update \(\theta_j \) , \(x_j \) will always be 1 and can thus be removed in the equation above.
Now, regarding the learning rate \( \alpha \) , we mentioned above that instead of a general global learning rate, we want a per-coordinate adaptive one, \( \alpha_c\) , for the coordinate (weight/signal) \(c\) . But replacing \( \alpha \) by the formula we mentioned previously, we obtain:
$$\theta_j = \theta_j – \alpha \frac{ h_{\theta}(x ) – y_i }{\sqrt{n_{\theta_j}}+1} $$
In our case, \( n_{\theta_j} \) is simply the number of times that this weight was updated so far (equivalently how many time its corresponding signal was observed in the training set) and this number is maintained in a simple array (the n variable in the code) of size D, with n[i] being that counter. And \( h_{\theta}(x ) \) is nothing else than the prediction (produced by the get_p function). So yes, the equation above is exactly the line of code performing the gradient update step:
w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
This line of code is obviously the most important one, where all the magic happens. Beautiful isn’t it?
Note that this implementation is what is commonly called “out-of-core”, meaning that it does not require to load the whole training set in memory, making it thus massively scalable.
I hope that at this stage you have a deep understanding of the theoretical foundation and practical aspects around logistic regression.
In the next post (yet to be written ), we’ll go over a very popular and highly scalable library implementing all the tricks we saw above and more: Vowpal Wabbit. Stay tuned.
]]>
In this post, we’ll deep dive into the theory behind logistic regression, giving the intuition behind its core concepts and its multiple faces across various fields of statistics and computer science. This will involve some maths, but nothing too deep assuming you have some notions of calculus and core statistics.
In the second part of this series, we’ll be much more concrete and deep dive into the implementation details of logistic regression, and go over some tricks like the hashing trick and the per coordinate adaptive learning rate which are making logistic regression works very well in practice on real (big) data sets. In that second post we’ll also go over a beautifully simple and elegant implementation of online logistic regression including all those tricks. In the third part of this series we’ll demonstrate the usage of a very powerful and popular library implementing logistic regression (and more) at scale: Vowpal Wabbit.
For now, let’s start with the theory
We’ll start by introducing a standalone description of logistic regression, similar to what you can find in any classical introduction to machine learning course (e.g. that one to cite the most popular of them all).
So you have a training set of N examples \(\{ (x^{(1)} , y_1) , …, (x^{(N)} , y_N) \}\) where \(x^{(i)} ∈ \{0, 1\}^d\) is a sparse binary feature vector in a d-dimensional space, a.k.a. the signals or features of the \(i\)th training example (more on that signals representation later, especially in part 2 of that series), and \(y_i \in \{0,1\}\) is the label associated to that example (which could represent a click/non click, spam/not spam, malignant/benign, …):
$$\left\{
\begin{array}{ll}
x_1^{(1)},…,x_d^{(1)} & y_1 \\
… \\
x_1^{(N)},…,x_d^{(N)} & y_N
\end{array}
\right. $$
To make a prediction for a given signal vector \(x ∈ \{0, 1\}^d\) , the logistic regression model proposes to take a linear combination \( \theta^Tx \) where \(\theta\) is a vector of parameters (weights) \(\theta_1, … , \theta_n\) , and to project it into the \([0..1]\) range by applying the logistic (or sigmoid) function directly to that linear product, giving the following model representation:
$$h_{\theta}(x) = logistic( \theta^Tx ) = \frac{1}{1+e^{-\theta^Tx}} $$
The usual interpretation of \(h_{\theta}(x) \) is that it represents the estimated probability that \(y=1\) on input \(x\), in other words: \(h_{\theta}(x) = p(y=1|x)\) . Then, if you have to use that number to predict weather \(y = 1\) or \(y =0\), some threshold is picked , either simply 0.5 (i.e. predicting \(y = 1\) when \(h_{\theta}(x) >= 0.5\) and 0 otherwise) or any other threshold empirically chosen using the classifier’s ROC curve (c.f. my older other post for more details on that).
Note that logistic regression is a linear classifier given that its decision boundary is a linear combination of the input. Indeed, if your threshold is e.g. 0.5, then you have \(y = 1\) when \(h_{\theta}(x) >= 0.5\) . If you draw the sigmoid function \(g(z)\), you can see that it is >= 0.5 when \(z>=0\) . Thus \(h_{\theta}(x) >= 0.5\) when \(\theta^Tx >= 0 \) , which is a linear decision boundary.
Let’s now talk about the cost function which is the most important part when building a model given that it is what need to be minimised on the training data to learn the optimal weight vector\( \theta\). Given the model representation, we cannot take a standard cost function based on MSE because it would make it non convex. All the power of logistic regression is in its cost function which looks as follow:
$$
Cost(h_{\theta}(x),y) = \left\{
\begin{array}{ll}
-log(h_{\theta}(x)) & \textrm{if} \quad y =1 \\
-log(1-h_{\theta}(x)) & \textrm{if} \quad y =0
\end{array}
\right.
$$
The beauty behind that cost function is first that it is very intuitive, because when you predict 0 instead of 1 (or 1 instead of 0), then your cost tends to infinity (and thus you penalize the learning algorithm by a very large cost), but most importantly, this cost function is convex (check here for a proof), thus allowing to use any standard gradient descent based optimization algorithm .
Note that this function can be written \(-[y\thinspace log(h_{\theta}(x)) +(1-y) log(1-h_{\theta}(x))] \) (just replace \( y\) by 0 or 1 to be convinced). We’ll denote \(Cost(\theta)\) the average cost on the whole training set\(\{ (x^{(1)} , y_1) , …, (x^{(N)} , y_N) \}\) , which is defined as:
$$Cost(\theta) = -\frac{1}{N}\sum\limits_{i=1}^{N} [y_i\thinspace log(h_{\theta}(x^{(i)})) +(1-y_i) log(1-h_{\theta}(x^{(i)}))] $$
This is also sometimes called the logarithmic loss. You can define a multi-class version of it (when your output can take more than 2 values) , see e.g. here or here for some intuitive explanations.
So, bottom line, we need to find the optimal weight vector \( \theta \) by solving \(\underset{\theta}{min} \thinspace Cost(\theta) \) . To do so, gradient descent is the natural tool. We simply need to compute the partial derivative of \( Cost(\theta) \) according to each weight \( \theta_j \) of \( \theta \), i.e. \( \frac{\partial }{\partial \theta_j} Cost(\theta) \) . We won’t go into the details of the actual derivative calculation (you can find it e.g. here ) but just remember the notations: the \(i\) training example \( x^{(i)} \) is a vector \((x_1^{(i)},…,x_d^{(i)}) \) , and \(\theta^Tx^{(i)} = \theta_0 + \theta_1 x_1^{(i)} + … + x_d^{(i)}\) and thus, for instance, \( \frac{\partial }{\partial \theta_j} \theta^Tx^{(i)} = x_j^{(i)} \) . The result of the calculation of the partial derivative gives:
$$ \frac{\partial }{\partial \theta_j} Cost(\theta) = \sum\limits_{i=1}^{N} ( h_{\theta}(x^{(i)}) – y_i )x_j^{(i)} $$
This concludes all what is needed to solve \(\underset{\theta}{min} \thinspace Cost(\theta) \) to find the optimal weight vector \( \theta \) from our training data. Indeed, assuming some learning rate \( \alpha \) , we simply have to iterate enough times over updating all the weights \( \theta_j \) of \( \theta \) using the gradient step below, until we observe that the cost is not reducing anymore :
$$ \theta_j = \theta_j \thinspace – \alpha \sum\limits_{i=1}^{N} ( h_{\theta}(x^{(i)}) – y_i )x_j^{(i)} $$
At the end of your learning procedure via gradient descent as described above, you end with an “optimal” weight vector \( \theta = (\theta_0,…, \theta_d) \) , with \( \theta_j \) the weight associated with the input signal \( x_j \) . In a simple linear regression model, the interpretation for that weight would be that if the corresponding signal \( x_j \) increases by one unit, then the predicted output increases by \( \theta_j \) units. In logistic regression it cannot be really interpreted that way given that we’re dealing with the sigmoid function and probabilities.
To understand how to interpret the learned weights in logistic regression, we first need to define and understand the notion of odds ratio. Let’s say that the probability of some event to happen (e.g. a basketball team winning a game) is \(p=0.8 \) . The probability of them loosing is \(1-p = 0.2 \) . The odds ratio is simply defined as the ratio between probability of success and probability of failure, \( \frac{p}{1-p} \) i.e. 0.8 / 0.2 = 4 in our example. The interpretation is that the odds for the basketball team to win are 4 to 1.
How does that relate to logistic regression? To answer, you just need to know that the inverse function of the logistic function is the logit function . We thus have:
$$ logit(h_{\theta}(x)) = logit(logistic(\theta^Tx)) = \theta^Tx $$
Let’s remind that \(h_{\theta}(x) \) represents the probability of the outcome being 1 (given a signal vector \(x \) ). Let’s denote that probability p. We thus have:
$$ logit(p) = \theta^Tx $$
Now the interesting part is that \(logit(p) = log(\frac{p}{1-p}) \) . Noticed \(\frac{p}{1-p}\)? Yep, that’s the odds ratio defined above . In other words, logistic regression is a model relating the log odds probability of the outcome as a linear combination of the input signals:
$$log(\frac{p}{1-p}) = \theta_0 + \theta_1x_1 + … + \theta_dx_d $$
We can now interpret the meaning of a weight \( \theta_j \) : if the signal \( x_j \) increases by one unit (or if it is present in case it is a boolean signal), then it increases by \( \theta_j \) the log odds of the outcome. Even more interpretable, if you take the exponent of both sides in the expression above you get:
$$\frac{p}{1-p} = e^{ \theta^T x} = \prod\limits_{j=0}^{d}e^{ \theta_j x_j} $$
which gives a direct relation with the odds and thus an even more simple interpretation of the weight \( \theta_j \) : the value \( e^{\theta_j} \) directly gives you the increase in the odds of the outcome if the value signal \( x_j \) increases by one unit (or if it is present in case it is a boolean signal) . Example: if one of your signal is a boolean “already won NBA finals” for your predicting probability of a basketball team to win, and that it gets a weight of say \(1.2\) , the interpretation would be: if the team already won an NBA finals, then it increases its odds of winning by \( e^{1.2} \approx 3.32 \) , meaning an increase of 232% (i.e. \((3.32-1)*100\) ) in the odds of winning.
Bottom line: If a signal \( x_j \) ends up with a weight \( \theta_j \) in logistic regression, it means that if the signal increases by one unit (or just if it is equal to 1 in case of boolean signal), then it increases the odds of the outcome to be 1 (e.g. a click happening) by \( (e^{\theta_j} -1)*100 \)%.
The concept behind logistic regression is so remarkable and efficient that it arose from various different fields, including different branches of computer science and statistics, and often, you stumble upon different ways of deriving it, including various different names for the cost function or what needs to be maximised or minimised etc.., which might make the whole thing quiet confusing. For instance, in NLP, logistic regression (more precisely the multi-class version of it) is often called Maximum Entropy (or MaxEnt), first defined in that paper . In this section, i’ll just recall the probabilistic view of logistic regression and connect the dots between cross-entropy, MLE, negative log likelihood, and logLoss .
First, entropy is a powerful concept invented by Claude Shanon who basically set the ground for information theory (if you want to get the gist of it from scratch, check this very nice vulgarization video). Cross-entropy is often used as a way to measure the difference between two probability vectors in the context of multinomial classification (a generalisation of the binary classification problem we’re interested in ), c.f. e.g that short video . The “binary” version of cross entropy (i.e. its particular case when you have only two output classes like in our setting) is defined over the two vectors \( p = (y, 1-y)\) and \(q = (\hat{y} , 1-\hat{y}) \) where \( y\) is the observed true value and \( \hat{y} \) is the prediction:
$$ H(p,q) = -\sum_{i=1} p_i log q_i \\ = -ylog(\hat{y}) – (1-y)log(1-\hat{y})$$
This gives you a measure of “disorder” between the two vectors (the true one and the predicted one). In our case, \(\hat{y} = h_{\theta}(x) \) , so the average cross entropy on the whole training set is:
$$ -\frac{1}{N}\sum\limits_{i=1}^{N} [y_i\thinspace log(h_{\theta}(x^{(i)})) +(1-y_i) log(1-h_{\theta}(x^{(i)}))] $$
Wait, did you notice? This is exactly the log loss cost function we had in the first section!!
And there is more.
Let’s move to another very popular concept in machine learning called Maximum Likelihood Estimation (MLE) . MLE is a simple yet very powerful tool to estimate a (set of) parameter(s) based on observed data (if you have never heard about it and need an explanation “for dummies” then you can check this video for the high level idea and that one for a specific example). When you want to use MLE, the first step is to write down the probability of observing the data (in our case the \( y_i, …, y_N \) ) given the input signals \( x^{(1)}, …, x^{(N)} \) and the vector of parameters \( \theta \) :
$$ Pr(y_1, …, y_N |x^{(1)}, …, x^{(N)} , \theta) = \prod\limits_{i=1}^{N}Pr(y_i| x^{(i)}, \theta) $$
Given that in our case \( y_i \) is either 0 or 1, a common trick is to write that:
$$Pr(y_i| x^{(i)}, \theta) = \\ Pr(y_i=1 |x^{(i)}, \theta)^{y_i} \thinspace Pr(y_i=0 |x^{(i)}, \theta)^{1-y_i} $$
The actual likelihood function always inverse the parameters in the notation to make clear that we are looking for an optimal \( \theta \) given the fixed observations of the training set:
$$ L(\theta , x^{(1)}, …, x^{(N)} | y_1, …, y_N ) = \\ \prod\limits_{i=1}^{N} Pr(y_i=1 |x^{(i)}, \theta)^{y_i} \thinspace Pr(y_i=0 |x^{(i)}, \theta)^{1-y_i} $$
Note that the same form could have been obtained without the need for the previous trick by simply noticing that in the case of binary classification, the proper likelihood function is Bernoulli . Now, we denote \(Pr(y_i=1 |x^{(i)}, \theta) \) as \(h_{\theta}(x)\) (exact same notation as in the first section). We’ll also denote \( L(\theta) \) the likelihood function for convenience. MLE thus suggest we find the \( \theta \) maximizing that likelihood function (hence the name maximum likelihood), in other words:
$$ \underset{\theta}{\arg\max} L(\theta) = \underset{\theta}{\arg\max} \prod\limits_{i=1}^{N}h_{\theta}(x^{(i)}) ^{y_i} \thinspace (1-h_{\theta}(x^{(i)}))^{1-y_i} $$
Since the next step is always to find a derivative of the likelihood, you almost always take the log of the likelihood since it transforms the product into a sum (on which it is much easier to apply derivatives), and that the logarithm function is monotonic (strictly increasing), and thus maximizing the log likelihood is equivalent to maximizing the likelihood, as well as minimizing the negative log likelihood. So applying a log on the above product gives:
$$ \underset{\theta}{\arg\max} \thinspace log \thinspace L(\theta) = \\ \underset{\theta}{\arg\max} \sum\limits_{i=1}^{N} y_i log(h_{\theta}(x^{(i)})) (1-y_i)log(1-h_{\theta}(x^{(i)})) $$
Instead of looking for the maximum of the log likelihood, you can equivalently look for the minimum of the negative log likelihood. If you take the average negative log likelihood on the training set, what do you obtain? you guessed it, once again, the exact same log loss cost function we found both in the first section and also via cross entropy!!!
As a final link between logistic regression and other well known concepts in ML or statistics, logistic regression is often compared with Naive Bayes, see here (wikipedia), here (more detailed book chapter) and here (high level Quora answer). But the point is that naive bayes can be seen as a generative version of logistic regression (which is a discriminative model, here is a nice Quora discussion if you want to understand the difference between generative and discriminative models ).
Bottom line: in the context of logistic regression, when you’ll hear about log loss or cross entropy or negative log likelihood, you’ll now know why and how they are so closely related.
I hope you enjoyed that post. If you want to get to the details allowing to make this work at scale and actually see an implementation connecting it all in 30 lines of python, continue to part 2 of this series .
]]>Kaggle offered this year a knowledge competition called “Titanic: Machine Learning from Disaster” exposing a popular “toy-yet-interesting” data set around the Titanic. The goal is to predict as accurately as possible the survival of the titanic’s passengers based on their characteristics (age, sex, ticket fare etc…)
.
In that post, we’ll use that data set in order to:
The whole code creating all the plots/stats and models exposed in that post and also building an output reaching a score 0.79426 on the leaderboard can be found on github here or on Rpubs here (built with Knit HTML from R studio ).
Preliminaries
First, download the test and training set from the data page of the competition (here is a zip of the two small files in case the page from kaggle is removed in the future).
Once you loaded the dataset into a data frame, you can do some data analysis/explorations. Even though that part is critical to start playing and feeling the data, I won’t go into details because there already were blog posts written about that, in particular that one is a very nice R version of the getting started with excel data exploration tutorial on Kaggle’s website.
However, i’ll just illustrate a nice simple and effective way of observing one important aspect of the data: missing values.
The Amelia R package is a toolbox around missing values, in particular for performing imputation of the missing data. Getting a visual and global insight about missing data in the test and train set is as simple as that:
library(Amelia)
#... code for loading test and train data in a data frame
missmap(rawdata, main = "Missingness Map Train")
missmap(test, main = "Missingness Map Test")
From those maps, you can immediately observe that only the age feature is badly suffering from missing data. Considering how small is the training set, you can hardly just ignore records having a missing age. We’ll see later in the post what kind of strategy we can use to deal with that issue.
The caret package is a kind of toolbox for homogenising the many existing R packages for classification and regression and also provide out of the box a standard way to perform common tasks like model parameters tuning and more. Also, the author (Max Khun) did an amazing job at documenting the package in the vignettes (here or here for a longer but older version) and on the package dedicated website.
Here is a snippet of code where i successively train a random forest and a gradient boosting machine (GBM) using the same train function from caret.
forest.model1 <- train(survived ~ pclass + sex + title + sibsp +parch , data.train, importance=TRUE) fitControl <- trainControl(## 10-fold CV method = "repeatedcv", number = 10, ## repeated ten times repeats = 10) gbm.model2 <- train(survived ~ pclass + sex + title + sibsp +parch , data.train, distribution = "gaussian", method = "gbm", trControl = fitControl, verbose = FALSE)
We’ll discuss later the features used in the formula but note the fitControl parameter which is passed in the call for training the GBM. This parameter allows to completely define the way the model parameters will be tuned. In that example, the model parameters of the GBM (namely interaction.depth, n.trees and shrinkage, see output below) were compared using a repeated 10-fold cross validation with accuracy being the metric for comparison, but everything is tuneable for that purpose (you can even pass a grid of specific values to compare for each model parameter).
712 samples 13 predictors 2 classes: 'yes', 'no' No pre-processing Resampling: Cross-Validation (10 fold, repeated 10 times) Summary of sample sizes: 642, 640, 642, 641, 640, 640, ... Resampling results across tuning parameters: interaction.depth n.trees Accuracy Kappa Accuracy SD Kappa SD 1 50 0.8 0.565 0.0436 0.0964 1 100 0.801 0.567 0.0436 0.0965 1 150 0.801 0.568 0.0434 0.096 2 50 0.795 0.548 0.0426 0.097 2 100 0.801 0.559 0.0437 0.0999 2 150 0.804 0.565 0.0435 0.1 3 50 0.805 0.568 0.0449 0.102 3 100 0.807 0.573 0.0464 0.106 3 150 0.809 0.576 0.0442 0.1 Tuning parameter 'shrinkage' was held constant at a value of 0.1 Accuracy was used to select the optimal model using the largest value. The final values used for the model were interaction.depth = 3, n.trees = 150 and shrinkage = 0.1.
Also, you can easily visualize variable importance (you need to specify importance=TRUE in the train function, as we did, for having it):
You can observe that the variable value with the most importance is the title Mr . The interesting part is that the feature “title” was not initially in the data set and was artificially created (we’ll detail a bit more about it later in the post). But overall, caret offers a very nice framework for easy models comparison and tuning with proper/uniform built-in cross-validation routines.
One thing though that is so true and said in perfect way in this must-watch killer talk: “Don’t get stuck in algorithm land! Focus on putting better data in the algorithm”. We’ll see an example illustrating that later in the post.
Pick the best threshold for your classifier using ROC curves
Most classifiers usually output the probability of an example belonging to a specific class (here ‘survived’ or ‘died’). When the only matter is to optimise accuracy (as it is usually the case in competitions), it is useful to pick the optimal threshold/cutoff for assigning one class or the other.
ROC curves can be used for that and also to assess the robustness of your model. If you’ve never heard about ROC curves, this article gives the basic intuition and that paper goes much more into details while still being crystal clear (i warmly recommend the later if you’re interested in the subject). For a standalone very clear example in R, this post is what you need (the code below is inspired from it).
The pROC package allows to easily analyse and display ROC curves. Here, we’re interested in the threshold corresponding to the top left corner of the curve maximising sensitivity and specificity.
#code inspired from http://mkseo.pe.kr/stats/?p=790 result.predicted.prob.model1 <- predict(forest.model1, data.test, type="prob") result.roc.model1 <- roc(data.test$survived, result.predicted.prob.model1$yes) plot(result.roc.model1, print.thres="best", print.thres.best.method="closest.topleft") result.coords.model1 <- coords( result.roc.model1, "best", best.method="closest.topleft", ret=c("threshold", "accuracy")) result.coords.model1
Which will output both a graph:
and high level information about the curve, e.g. :
Call: roc.default(response = data.test$survived, predictor = result.predicted.prob.model1$yes) Data: result.predicted.prob.model1$yes in 78 controls (data.test$survived yes) > 65 cases (data.test$survived no). Area under the curve: 0.931
Note in particular the Area under the curve (a.k.a AUC) data point which is sometimes used to assess the robustness/quality of your model, although it has been questioned a lot and often criticised to not be a precise/useful classification performance measure (a small discussion around it can be found here). In other words, you’re often better off relying on your K-fold cross validation measures to assess your out-of-sample performance (c.f. the previous section on caret).
Tweak and tricks
I’ve hinted earlier that the number of missing ages was too high and the training set too small to just ignore the records having a missing age. At least for me, any attempt to impute the missing ages (either in naive or more sophisticated ways) didn’t lead to any significant accuracy improvement on the 10-fold cross validation test.
Turns out that extracting the title (i.e. Mr or Mrs. etc…) in the Name attribute of the data set did lead to an improvement (from the competition’s forums, i saw that few people used that feature as well). Let’s have a look at the age distributions per extracted title in the training set (some rare occurrences of titles were aggregated into larger titles, e.g. “Capt”, “Col”, “Major”,”Sir”, “Don”,”Dr” were mapped to “Mr”):
This somehow matches the intuition (though I didn’t know that in apparently old/traditional english, “Master” denotes a young/unmarried man). And it also makes sense intuitively that Title is a good proxy for the too many missing ages, allowing for totally ignoring the age feature and thus keep all the data in the training set, without introducing any potential noise with an imputation method.
When i’ve plugged in this new Title feature into the random forest, i saw an improvement from 0.785 to 0.801 on my 10-fold cross validation out-of-sample accuracy estimation, and it was reflected in my submission on the public leaderboard where i jumped to the top 5% best submissions at that time.
Note that an improvement on your cross validation is not always reflected on the leaderboard, sometimes even the opposite (c.f “Lesson One” from this very cool blog post by @rouli, highly recommended). Note also that this particular competition lasts 1 year and was just for learning purpose, so there are thousands and thousands of participants, including not few people who obviously spent useless time to extract the answers from publicly available lists (e.g. here or here) to get a near perfect score (though you could use them to know you near real final score on the private leaderboard if you can’t wait the end of the competition, but still kind of pointless). Finally, more things can be done to try improve the accuracy even more, an obvious one being to combine multiple models together (majority vote is often used in binary/multi-class settings) but we won’t cover that in this post.
Conclusion
We explored on a comprehensive example how R can be used to build and tune quickly robust predictive models which are significantly outperforming the baseline. Of course, it is somehow a toy example but it was interesting enough to explore some important aspects needed when building predictive models. For much bigger data sets (both in terms of training set size and/or number of features in the data) you might need to introduce different/additional technical and theoretical tools that we might explore in future posts.
Also, note that a competition settings might be very different than a real production settings. Not only talking about why Netflix never implemented the model that won the $1M challenge, but also the whole infrastructure that you’d need to build in order to do big data science at scale on many different problems (Scala is quickly becoming a trend around that, check those killer slides and talk by my friend @BigDataSc from LinkedIn and @ccservers from eBay for more on that ).
I’ll conclude by citing again this awesome sentence from this must-watch talk by @nmkridler : “Don’t get stuck in algorithm land! Focus on putting better data in the algorithm”. I really think that this is what data science is all about.
References / Useful Links
]]>
You would like to be able for instance to see what are the tokens with the biggest TF-IDF weights in any given document of the collection.
Lucene and Mahout can help you to do that almost in a snap.
If you don’t know how to build a Lucene index, check the links at the end of the post.
The two only important things in that step are to have in your index a field that can serve as a document id and to enable term vectors on the text field representing the content of your documents.
So your indexing code should contains at least two lines similar to:
doc.add(new Field("documentId", documentId, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED,TermVector.YES));
That step is well described here. It also explains how to generate the vectors from a directory of text documents. I used lucene because my documents were in a data store and building the lucene index out of it was just much more flexible and convenient.
You then should end up executing a command similar to:
./mahout lucene.vector --dir "myLucenIndexDirectory" --output "outputVectorPathAndFilename" --dictOut "outputDictionnaryPathAndFilename" -f content -i documentId -w TFIDF
Mahout will generate for you:
Now, let’s say that you want for a given document id, to see what are the tokens that received the biggest weights in order to feel what are the most significant tokens of that document (as the weighting scheme sees it).
To do so, you can for instance easily load the content of the generated dictionary file into a Map with token index as keys and the tokens as values. Let’s call that map dictionaryMap.
Then you’ll have to walk through the generated binary file containing the vectors. By playing a little bit with the sequence file and the Mahout source code, you get pretty quickly what are the important objects you have to manipulate in order to access vectors content in a structured way:
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String vectorsPath = args[1];
Path path = new Path(vectorsPath);
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
LongWritable key = new LongWritable();
VectorWritable value = new VectorWritable();
while (reader.next(key, value)) {
NamedVector namedVector = (NamedVector)value.get();
RandomAccessSparseVector vect = (RandomAccessSparseVector)namedVector.getDelegate();
for( Element e : vect ){
System.out.println("Token: "+dictionaryMap.get(e.index())+", TF-IDF weight: "+e.get()) ;
}
}
reader.close();
The important things to get in that code are the following:
From there you’ll be able easily to plug your code to do whatever you want with the tokens and their weights, like printing the token having the biggest weights in a given document.
It can be insightful to tune your weighting model. E.g. you can quickly observe that typing errors are often getting a super high weight, which makes sense in the TF-IDF weighting scheme (unless the typing error is very frequent in your document collection), and thus you might want to fix that.
It is also useful just to understand a little bit more of how mahout represents the data internally.
Useful links:
I wrote a small program on top of this prototype to run an experiment: each hour, the program crawl the new list of hot queries from google hot trends, then it runs the prototype on each of those queries and keep the hottest link found in twitter for the corresponding hot query. I wanted to see which websites were mostly cited in those tweets talking about hot trends.
So I let ran the program for a week, collected the links (more than a thousand), expanded all those into their long URLs version (using an improved version of my java universal URL expander), extracted the domain names and compiled the whole into a top 10 list of the most cited websites. Here it is (click to enlarge):
I was surprised to see some websites that I’ve never heard about before (like wpparty.com or actionnewsblast.com).
To have a better idea for which kind of hot queries/topics those websites are most cited in twitter, find below, for each of those top website, a sample of 5 google hot trends query they covered last week.
Few remarks:
To finish, find below a picasa collage that I built using the most cited twitpic pictures in twitter for this past week of hot trends (not only the 5 cited in the table). You’ll identify easily some sarcastic pictures before the iPad announcement or pics around the election of Miss USA. Click the picture to enlarge.
If you’re curious to map some pictures with its related hot topic, click the collage to enlarge it and try to guess which pics correspond to which google hot query below :).
miss america 2010 winner, what celeb do i look like, miss america 2010, roberta flack, lady gaga and elton john, addicted to love, jim florentine, apple itablet, lost season 6 premiere, candy crowley, to make you feel my love, swagger crew, footprints in the sand, gasparilla, miss virginia, duke georgetown, celebrity look alike, katherine putnam, itablet, andrea bocelli, monster diesel, peta ad.
]]>Indeed, to give you the intuition of why combiner helps reducing the number of data sent to the reducers, imagine the word count example on a text containing one million times the word “the”. Without combiner the mapper will send one million key/value pairs of the form <the,1>. With combiners, it will potentially send much less key/value pairs of the form <the,N> with N a number potentially much bigger than 1. That’s just the intuition (see the references at the end of the post for more details).
Simply speaking a combiner can be considered as a “mini reducer” that will be applied potentially several times still during the map phase before to send the new (hopefully reduced) set of key/value pairs to the reducer(s). This is why a combiner must implement the Reducer interface (or extend the Reducer class as of hadoop 0.20).
In general you can even use the same reducer method as both your reducer and your combiner. This is the case for the word count example where using a combiner remains to add a single line of code in your main method:
conf.setCombinerClass(Reduce.class);
where conf is your JobConf, or, if you use hadoop 0.20.1:
job.setCombinerClass(Reduce.class);
where job is your Job built with a customized Configuration.
That sounds pretty simple and useful and at first look you would be ready to use combiners all the time by adding this simple line, but there is a small catch. The first kind of reducers that comes naturally as a counter example of using combiner is the “mean reducer” that computes the mean of all the values associated with an given key.
Indeed, suppose 5 key/value pairs emitted from the mapper for a given key k: <k,40>, <k,30>, <k,20>, <k,2>, <k,8>. Without combiner, when the reducer will receive the list <k,{40,30,20,2,8}>, the mean output will be 20, but if a combiner were applied before on the two sets (<k,40>, <k,30>, <k,20>) and (<k,2>, <k,8>) separately, then the reducer would have received the list <k,{30,5}> and the output would have been different (17.5) which is an unexpected behavior.
More generally, combiners can be used when the function you want to apply is both commutative and associative (that’s pretty intuitive to understand why). That’s the case for the addition function, this is why the word count example can benefit from combiners but not for the mean function (which is not associative as shown in the counter example above).
Note that for the mean function you can use a workaround for using combiners by using two separate reduce methods, a first one that would be used as the addition function (and thus that can be set as the combiner) that would emit the intermediate sum as the key and the number of addition involved as the value, and a second reduce function that would compute the mean by taking into account the number of addition involved (see the references for more details on that).
As usual in this series, let’s observe the lesson learned in action using our learning playground. For that you can use the original word count example (or its hadoop 0.20.1 version that we used in the previous issue), add it the single combine line as specified earlier in the post and run it on our moby-dick mascot. Here what we can see at the end of the execution:
Now that you understand what counters are, if you click to enlarge the picture, you’ll see the value of two counters: Combine input records=215137 and Combine output records=33783. That’s a pretty serious reduction of the number of key/value pairs to send to the reducers. You can easily imagine the impact for much larger jobs (see the reference below for real numbers).
Enjoy combiners, whenever you can…
References
In this 3rd issue of the hadoop tutorial series, we’ll speak about a very simple but very useful hadoop’s feature: counters.
Even if you have never defined any counters in hadoop, you can see some of them each time you are running an hadoop job. Indeed, here is what you can see from the client console at the end of the execution of a job (can also be seen from the web interface):
As you can see, 18 internal counters are presented inside different groups. For instance, you can see a section “Job Counters” with three different counters giving basic information about the job like the number of mappers and reducers. In that example, “Job Counters” is called the group of the counter and “Launched reduce tasks” (for instance) is properly the name of the counter.
It is very handy to define your own counters to track any kind of statistics about the records you are manipulating in the mapper and the reducer. The most natural use of that is to use counters to track the number of malformed records.
If you are executing a job and you see an abnormally high number of malformed records, it can give a good hint that you perhaps have a bug in your code or some problem with your data (note this is actually a much simpler way to spot issues than tracking error messages in a distributed set of log files). But you can actually use counters for any kind of other statistics on your records.
One easy way to define your own counters from your Java code is:
So let’s see an example. We’ll take the word count example revised for version 0.20.1 to illustrate the use of counters. We will create a counter group called WordsNature that will count how many unique tokens there is in all, how many unique tokens starts with a digit and how many unique tokens starts with a letter.
So our enum declaration will look like that:
static enum WordsNature { STARTS_WITH_DIGIT, STARTS_WITH_LETTER, ALL }
We will also need a very basic StringUtils class:
package com.philippeadjiman.hadooptraining;
public class StringUtils {
public static boolean startsWithDigit(String s){
if( s == null || s.length() == 0 )
return false;
return Character.isDigit(s.charAt(0));
}
public static boolean startsWithLetter(String s){
if( s == null || s.length() == 0 )
return false;
return Character.isLetter(s.charAt(0));
}
}
Since we are interested in unique tokens, we will put the code related with the counter into the reduce method. So here how the reduce method will look like:
public void reduce(Text key, Iterable values, Context context)
throws IOException, InterruptedException {
int sum = 0;
String token = key.toString();
if( StringUtils.startsWithDigit(token) ){
context.getCounter(WordsNature.STARTS_WITH_DIGIT).increment(1);
}
else if( StringUtils.startsWithLetter(token) ){
context.getCounter(WordsNature.STARTS_WITH_LETTER).increment(1);
}
context.getCounter(WordsNature.ALL).increment(1);
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
Here is the code of the WordCountWithCounter that include this code.
If you want to run it inside our learning playground you’ll just have to update the pom with hadoop latest version:
<dependency>
<groupId>org.apache.mahout.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.1</version>
</dependency>
So here is the result after running the code with, as input, the whole text of moby dick:
So we can see now that we have 33783 unique tokens, 32511 starting with a letter and 263 starting with a digit. What about the 1009 others?? Well, because the word count example use a basic StringTokenizer that splits tokens at spaces, a lot of words simply starts with a ‘(‘ or with a ‘[‘ and even with ‘–‘. To solve that you can for instance use a lucene StandardAnalyzer.
You should now be able to easily implements your own counters for tracking bad records/missing values, debugging or gathering any kind of statistics around your job.
See you soon for another issue…
]]>Basically it should take as input a search query and return as output a ranked set of URLs that would correspond to the latest hot news around that search query.
In some way it is similar to what you would expect to find on google news or in one of the dozens real time search engine that were released last year (let’s cite oneriot, crowdeye and collecta).
The goal of my prototype is to demonstrate how to leverage twitter and a simple ranking algorithm to obtain most of the time relevant URLs in response of hot queries, without having to crawl a single web page! As my primary target is relevancy, I won’t invest any effort on performance or scalability of the prototype (retrieved results will be build at query time).
Basically what I did is to use the twitter API through a java library called twitter4j to retrieve all the latest tweets containing the input query and that contains a link. For very hot queries, you’re likely to get a lot of those (I put a limit of the last 150 but you’ll be able to change it). Once I got my “link farm”, what I do is to build a basic ranking algorithm that would rank first the URLs that are the most referenced.
As most of the URLs in tweets are shortened URLs, the trick is to spot the same URLs that were shortened by different shortening services. For instance both of the following shortened URLs points to a same page of my blog: http://tinyurl.com/yajkgeg and http://bit.ly/SmHw6. It can sounds as a corner case but it actually happens all the time on hot queries. So the idea is to convert all the short URLs in their expanded version. To see how to write an universal URL expander in JAVA that would work for the 90 + existing URL shortening services check the post that is referenced by the two short URLs above.
Note that you can improve the ranking algorithm in tons of way, by exploiting the text in the tweets or who actually wrote the tweet (reputation) or using other sources like digg and much more, but as we’ll see, even in its simplest form, the ranking algorithm presented above works pretty well.
To find some hot queries to play with, you can for instance take one of the google hot trends queries (unfortunately down from 100 to 40 to 20). Let’s try with a very hot topic while I’m writing this post: the google Nexus One phone that was about to be presented to the press two days after I started to wrote this post.
Below I have compiled the results obtained respectively by Google News, OneRiot and my toy prototype on the query “nexus one”. Click the picture to enlarge.
I hope you enjoyed my killer UI :). But let’s focus on the three URLs corresponding of the first result of each one:
Given the fact that at the time I issued the query, the Nexus one was not yet released, I would say that the article that the prototype found is the best one since it is the only one that present an exclusive video demonstrating the not yet released phone. This is also why so much people were twitting about this link: because it was the best at that precise time! We’ll see even more in the next section.
Before, let’s try with another hot query today (in the top 20 hottest queries of google hot trends): “byron de la beckwith”.
That time, it is not clear what is the story/news hidden behind that hot query but running it on the prototype gives as the first link the article below (click on the picture if you want to see the full article).
Again this is a very relevant result (oneRiot and Google News gave the same one at that time).
What is interesting with hot queries is that you expect the result to change even within a short amount of time. Indeed, any story or breaking news generally evolve as new elements comes in. As promised let’s follow our “nexus one” query.
In the previous section, the prototype’s first result was a very relevant article from engadget. I relaunched the same query, but after 12 hours. The first ranked result returned by my prototype gives me now a different result: still another article from engadget (see picture below), but that time with a much more in depth review of the phone with more videos including a very funny comparison between the android, iphone and nexus one.
Then I waited for Google doing its press conference one day later. I issued the query again. Can you guess what was the first link given by my prototype? You got it, the official Google Nexus One website.
Again this is not a corner case. This temporal aspect happens all the time, for any type of breaking news or events. As a last example of that phenomenon, let’s take the movie avatar. The first days before and after that the movie were released, all you got is links to see the trailer or even the movie. Now, few weeks after, what you get is a very fast changing list of links around fun pictures of parodies of the movie with title like “Do you want to date my avatar” (picture below) or a letter attempting to prove that avatar is actually Pocahontas in 3d :).
If you just want to run the prototype through the command line
You must have java 6 installed (you can check by opening a console and type java -version). On recent mac, see those instructions for having java 6 ready to use in a snap.
Then just download this zip archive: jarsDependencies.zip.
Save it and extract it somewhere in your computer. It will create a directory named prototypeJars.
Open a command prompt. Go inside the directory prototypeJars.
If you are on windows, just type:
java -cp "*;" com.philippeadjiman.rtseproto.RealTimeSEPrototype "nexus one" 150 OFF
If you are on Linux or Mac just type:
java -cp "*" com.philippeadjiman.rtseproto.RealTimeSEPrototype "nexus one" 150 OFF
You’ll notice the three last arguments (all are mandatory):
Only the top 20 first results will be printed.
If you want to play with the code
As the title suggests, that just few hundreds lines of (JAVA) code. As it is a toy project and to keep things simple I voluntarily didn’t use any DI framework like spring or guice and tried to use as less external libraries as possible unless necessary (even no log4j!). I did wrote a minimal amount of unit tests since I cannot code without it and I did use the google-collections library for the same reason :).
Also I tried to wrote at least a minimal amount of comments, in particular where I think the code should be improved a lot for better performance but remember: the prototype is of course not scalable as it does not rely on any indexing strategy (it computes the results at query time). Building a real a real search engine would at first involve building an index offline (using lucene for instance).
You’ll find the source code here prototype_src.zip.
If you are using maven and eclipse (or other popular IDE), you should be ready to go in less than a minute by unpacking the zip, typing “mvn eclipse:eclipse” and importing the existing project.
What I wanted to prove here is mainly that without crawling a single webpage, you can answer to “hot queries” with a relevancy comparable to what you can find on google news or any “real time search engine”. This is made possible by judiciously using the tremendous power that twitter provide with its open API.
Of course building a real “real time search engine” would require much more than few hundred lines of code and hundreds of features could be added to that prototype, but I would keep two core principles:
In that sense, among the dozens of real time search engines I have tested, my favorite one remains oneriot.
This is for the “pull” side of the things (when the user knows what to search for). I did not talk about the “push” side of the real time web here, probably in another post…
If you have issues running the prototype or any other question/remark, please shoot a comment.
]]>What is it all about?
As you may know, a map/reduce job will contains most of the time more than 1 reducer. So basically, when a mapper emits a key value pair, it has to be sent to one of the reducers. Which one? The mechanism sending specific key-value pairs to specific reducers is called partitioning (the key-value pairs space is partitioned among the reducers). A Partitioner is responsible to perform the partitioning.
In Hadoop, the default partitioner is HashPartitioner, which hashes a record’s key to determine which partition (and thus which reducer) the record belongs in.The number of partition is then equal to the number of reduce tasks for the job.
Why is it important?
First, it has a direct impact on the overall performance of your job: a poorly designed partitioning function will not evenly distributes the charge over the reducers, potentially loosing all the interest of the map/reduce distributed infrastructure.
Second, it maybe sometimes necessary to control the key/value pairs partitioning over the reducers. Let’s illustrate it on a simple example. Suppose that your job’s input is a (huge) set of tokens and their number of occurrences (for instance the output of the canonical word count hadoop example) and that you want to sort them by number of occurrences. Let’s also suppose that your job will be handled by 2 reducers. If you run your job without using any customized partitioner, you’ll get something like this:
As you can see, the tokens are correctly ordered by number of occurrences on each reducer (which is what hadoop guarantees by default) but this is not what you need! You’d rather expect something like:
where tokens are totally ordered over the reducers, from 1 to 30 occurrences on the first reducer and from 31 to 14620 on the second. This would happen as a result of a correct partitioning function: all the tokens having a number of occurrences inferior to N (here 30) are sent to reducer 1 and the others are sent to reducer 2, resulting in two partitions. Since the tokens are sorted on each partition, you get the expected total order on the number of occurrences.
Below, we’ll use our playground to observe the issue happening on real data and see how we solve it using customized partitioners.
Also, as a second example of use of customized partitioning functions, let’s cite the original map/reduce google paper: “sometimes the output keys are URLs, and we want all entries for a single host to end up in the same output. To support situations like this, the user of the MapReduce library can provide a special partitioning function. For example, using “hash(Hostname(urlkey)) mod R” as the partitioning function causes all URLs from the same host to end up in the same output”.
Feeling the partitions in our playground
If your playground is not yet set up, check the Issue #1 of this series. As an input for our job, we’ll use a tsv file containing the list of tokens and their number of occurrences extracted from (once again) the full moby dick text. Click here to download this input. You’ll notice that the pairs (tokens, #occurrences) are alphanumerically sorted on tokens value.
First, we’ll use a very simple pre-processing job to transform the input data into a more convenient format to use within hadoop: the Sequence File Output Format. Sequence files are a basic file based data structure persisting the key/value pairs in a binary format and allowing you to interact more easily with basic hadoop data types (e.g IntWritable, LongWritable, etc…). Here is the simple pre-processing job:
package com.philippeadjiman.hadooptraining;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class SortDataPreprocessor {
static class PreprocessorMapper extends MapReduceBase implements Mapper {
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
String[] tokens = line.split("t");
if( tokens == null || tokens.length != 2 ){
System.err.print("Problem with input line: "+line+"n");
return;
}
int nbOccurences = Integer.parseInt(tokens[1]);
word.set(tokens[0]);
output.collect(new IntWritable(nbOccurences),word );
}
}
public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(SortDataPreprocessor.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(PreprocessorMapper.class);
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setOutputFormat(SequenceFileOutputFormat.class);
JobClient.runJob(conf);
}
}
You’ll notice that:
To run it, package the job using maven (see Issue #1), put the input file on hdfs in an input directory (let’s call it input) and execute:
hadoop jar playing-with-partitions.jar com.philippeadjiman.hadooptraining.SortDataPreprocessor /user/training/input /user/training/pre_process
This will create a directory called “pre_process” on hdfs containing a set of pairs (#occurrences,token), respectively of format IntWritable and Text, in a SequenceFileOutputFormat.
Now we can, perform the sort based on this new input. Writing a job for such a task is actually trivial since this is primarily what hadoop is doing by default, so here it is:
package com.philippeadjiman.hadooptraining;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
public class SortExample {
public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(SortExample.class);
conf.setJobName("sortexample");
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(Text.class);
conf.setNumReduceTasks(2);
JobClient.runJob(conf);
}
}
You’ll notice that:
Just run it using:
hadoop jar playing-with-partitions.jar com.philippeadjiman.hadooptraining.SortExample /user/training/pre_process /user/training/output
Once completed, an output directory will be created on hdfs with two files, one for each reducer that were used. You can observe the content of the output using commands like:
hadoop fs -cat output/part-00000 | less
hadoop fs -cat output/part-00001 | less
As you’ll see, the two outputs are sorted but do not represent a total order, as explained above. Let’s fix it.
How to implement your own partitioning function
So how do we create a total order out of those two reducers?
A first solution is to create our own partitionner which is as simple as implementing the Partitioner<K,V> interface:
package com.philippeadjiman.hadooptraining;
package com.philippeadjiman.hadooptraining;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;
public class MyPartitioner implements Partitioner<IntWritable,Text> {
@Override
public int getPartition(IntWritable key, Text value, int numPartitions) {
/* Pretty ugly hard coded partitioning function. Don't do that in practice, it is just for the sake of understanding. */
int nbOccurences = key.get();
if( nbOccurences < 3 )
return 0;
else
return 1;
}
@Override
public void configure(JobConf arg0) {
}
}
This implementation of getPartition specifies to put all the pairs having a key (which is here the number of occurrences) being less than 3 into the first partition and the other into the second one. This is of course a pretty bad practice to hard code like that a partitioning function but this is simply for the sake of understanding.
To use this created partition just add the following line to the main method of the previous SortExample class:
conf.setPartitionerClass(MyPartitioner.class);
Why did I choose 3? Because as a side effect of the Zipf law, the number of tokens having a number of occurrences of 1 and 2 will be as much as big that all the others together! (to see a Zipf Law in action, check this post). So 3 was chosen just for balancing a little bit the partitions.
Re-package the code with the customized partition, remove the old output, run it again and check that our problem is solved: there is a total order over the partitions.
How to automatically find “good” partitioning function using sampling
Now, as I mentioned above, it is a pretty bad practice to hard code how to partition the keys. But on the other hand, how would I know automatically and in advance how to divide the partition in the general case?
Hadoop provide a nice way to approximate a priori a good partitioning function using an InputSampler. For instance, a RandomSampler, will sample the input at random to estimate what is the best way to partition. The sampler will wrote into a file called, by default, _patition.lst describing the partition that the job will automatically use to decide which key/value pairs to send to which reducers. This mechanism has to be used in combination with a TotalOrderPartitioner.
Here is a code sample using such a sampler with a total order partitioner:
InputSampler.Sampler<IntWritable, Text> sampler =
new InputSampler.RandomSampler<IntWritable, Text>(0.1, 100);
InputSampler.writePartitionFile(conf, sampler);
conf.setPartitionerClass(TotalOrderPartitioner.class);
Sometimes, there are some issues with the file _partition.lst that is not found. It always worked for me when I specified explicitly where to find the file using the TotalOrderPartitioner.setPartitionFile(); method. Also pay attention to invoke this method before the call to writePartitionFile. Also, note that the sampling mechanism is necessary since considering all the input to compute the partition would be inefficient for large files.
Some remarks
Conclusion
Partitioning in map/reduce is a fairly simple concept but that is important to get correctly. Most of the time, the default partitioning based on an hash function can be sufficient. But as we illustrated in this Issue, you’ll need sometime to modify the default behavior and to customize your own partitioning suited for your needs.
If you have some questions, or if you have experimented other use cases of customized partitioning in your application, please comment/share. See you soon for another Issue.
]]>