This is a common misunderstanding of vowpal wabbit.
One cannot compare batch learning with online learning.
vowpal wabbit is not a batch learner. It is an online learner. Online learners learn by looking at examples one at a time and slightly adjusting the weights of the model as they go.
There are advantages and disadvantages to online learning. The downside is that convergence to the final model is slow/gradual. The learner doesn't do a "perfect" job at extracting information from each example, because the process is iterative. Convergence on a final result is deliberately restrained/slow. This can make online learners appear weak on tiny data-sets like the above.
There are several upsides though:
- Online learners don't need to load the full data into memory (they work by examining one example at a time and adjusting the model based on the real-time observed per-example loss) so they can scale easily to billions of examples. A 2011 paper by 4 Yahoo! researchers describes how vowpal wabbit was used to learn from a tera (10^12) feature data-set in 1 hour on 1k nodes. Users regularly use
vw
to learn from billions of examples data-sets on their desktops and laptops.
- Online learning is adaptive and can track changes in conditions over time, so it can learn from non-stationary data, like learning against an adaptive adversary.
- Learning introspection: one can observe loss convergence rates while training and identify specific issues, and even gain significant insights from specific data-set examples or features.
- Online learners can learn in an incremental fashion so users can intermix labeled and unlabeled examples to keep learning while predicting at the same time.
- The estimated error, even during training, is always "out-of-sample" which is a good estimate of the test error. There's no need to split the data into train and test subsets or perform N-way cross-validation. The next (yet unseen) example is always used as a hold-out. This is a tremendous advantage over batch methods from the operational aspect. It greatly simplifies the typical machine-learning process. In addition, as long as you don't run multiple-passes over the data, it serves as a great over-fitting avoidance mechanism.
Online learners are very sensitive to example order. The worst possible order for an online learner is when classes are clustered together (all, or almost all, -1
s appear first, followed by all 1
s) like the example above does. So the first thing to do to get better results from an online learner like vowpal wabbit, is to uniformly shuffle the 1
s and -1
s (or simply order by time, as the examples typically appear in real-life).
OK now what?
Q: Is there any way to produce a reasonable model in the sense that it gives reasonable predictions on small data when using an online learner?
A: Yes, there is!
You can emulate what a batch learner does more closely, by taking two simple steps:
- Uniformly shuffle
1
and -1
examples.
- Run multiple passes over the data to give the learner a chance to converge
Caveat: if you run multiple passes until error goes to 0, there's a danger of over-fitting. The online learner has perfectly learned your examples, but it may not generalize well to unseen data.
The second issue here is that the predictions vw
gives are not logistic-function transformed (this is unfortunate). They are akin to standard deviations from the middle point (truncated at [-50, 50]). You need to pipe the predictions via utl/logistic
(in the source tree) to get signed probabilities. Note that these signed probabilities are in the range [-1, +1] rather than [0, 1]. You may use logistic -0
instead of logistic
to map them to a [0, 1] range.
So given the above, here's a recipe that should give you more expected results:
# Train:
vw train.vw -c --passes 1000 -f model.vw --loss_function logistic --holdout_off
# Predict on train set (just as a sanity check) using the just generated model:
vw -t -i model.vw train.vw -p /dev/stdout | logistic | sort -tP -n -k 2
Giving this more expected result on your data-set:
-0.95674145247658 P1
-0.930208359811439 P2
-0.888329575506748 P3
-0.823617739247262 P4
-0.726830630992614 P5
-0.405323815830325 P6
0.0618902961794472 P7
0.298575998150221 P8
0.503468453150847 P9
0.663996516371277 P10
0.715480084449868 P11
0.780212725426778 P12
You could make the results more/less polarized (closer to 1
on the older ages and closer to -1
on the younger) by increasing/decreasing the number of passes. You may also be interested in the following options for training:
--max_prediction <arg> sets the max prediction to <arg>
--min_prediction <arg> sets the min prediction to <arg>
-l <arg> set learning rate to <arg>
For example, by increasing the learning rate from the default 0.5
to a large number (e.g. 10
) you can force vw
to converge much faster when training on small data-sets thus requiring less passes to get there.
Update
As of mid 2014, vw
no longer requires the external logistic
utility to map predictions back to [0,1] range. A new --link logistic
option maps predictions to the logistic function [0, 1] range. Similarly --link glf1
maps predictions to a generalized logistic function [-1, 1] range.