A similar question was posted on the vw mailing list. I'll try to summarize the main points in all responses here for the benefit of future users.
Unbalanced training sets best practices:
Your training set is highly unbalanced (200,000 to 100). This means that only 0.0005 (0.05%) of examples have a label of 1
. By always predicting -1
, the classifier achieves a remarkable accuracy of 99.95%. In other words, if the cost of a false-positive is equal to the cost of a false-negative, this is actually an excellent classifier. If you are looking for an equal-weighted result, you need to do two things:
- Reweigh your examples so the smaller group would have equal weight to the larger one
- Reorder/shuffle the examples so positives and negatives are intermixed.
The 2nd point is especially important in online-learning where the learning rate decays with time. It follows that the ideal order, assuming you are allowed to freely reorder (e.g. no time-dependence between examples), for online-learning is a completely uniform shuffle (1, -1, 1, -1, ...)
Also note that the syntax for the example-weights (assuming a 2000:1 prevalence ratio) needs to be something like the following:
1 2000 optional-tag| features ...
-1 1 optional-tag| features ...
And as mentioned above, breaking down the single 2000
weighted example to have only a weight of 1
while repeating it 2000 times and interleaving it with the 2000 common examples (those with the -1
label) instead:
1 | ...
-1 | ...
1 | ... # repeated, very rare, example
-1 | ...
1 | ... # repeated, very rare, example
Should lead to even better results in terms of smoother convergence and lower training loss. *Caveat: as a general rule repeating any example too much, like in the case of a 1:2000 ratio, is very likely to lead to over-fitting the repeated class. You may want to counter that by slower learning (using --learning_rate ...
) and/or randomized resampling: (using --bootstrap ...
)
Consider downsampling the prevalent class
To avoid over-fitting: rather than overweighting the rare class by 2000x, consider going the opposite way and "underweight" the more common class by throwing away most of its examples. While this may sound surprising (how can throwing away perfectly good data be beneficial?) it will avoid over-fitting of the repeated class as described above, and may actually lead to better generalization. Depending on the case, and costs of a false classification, the optimal down-sampling factor may vary (it is not necessarily 1/2000 in this case but may be anywhere between 1 and 1/2000). Another approach requiring some programming is to use active-learning: train on a very small part of the data, then continue to predict the class without learning (-t
or zero weight); if the class is the prevalent class and the online classifier is very certain of the result (predicted value is extreme, or very close to -1
when using --link glf1
), throw the redundant example away. IOW: focus your training on the boundary cases only.
Use of --binary
(depends on your need)
--binary
outputs the sign of the prediction (and calculates progressive loss accordingly). If you want probabilities, do not use --binary
and pipe vw
prediction output into utl/logistic
(in the source tree). utl/logistic
will map the raw prediction into signed probabilities in the range [-1, +1]
.
One effect of --binary
is misleading (optimistic) loss. Clamping predictions to {-1, +1}, can dramatically increase the apparent accuracy as every correct prediction has a loss of 0.0. This might be misleading as just adding --binary
often makes it look as if the model is much more accurate (sometimes perfectly accurate) than without --binary
.
Update (Sep 2014): a new option was recently added to vw
: --link logistic
which implements [0,1]
mapping, while predicting, inside vw
. Similarly, --link glf1
implements the more commonly needed [-1, 1]
mapping. mnemonic: glf1
stands for "generalized logistic function with a [-1, 1]
range"
Go easy on --l1
and --l2
It is a common mistake to use high --l1
and/or --l2
values. The values are used directly per example, rather than, say, relative to 1.0
. More precisely: in vw
: l1
and l2
apply directly to the sum of gradients (or the "norm") in each example. Try to use much lower values, like --l1 1e-8
. utl/vw-hypersearch
can help you with finding optimal values of various hyper-parameters.
Be careful with multiple passes
It is a common mistake to use --passes 20
in order to minimize training error. Remember that the goal is to minimize generalization error rather than training error. Even with the cool addition of holdout
(thanks to Zhen Qin) where vw
automatically early-terminates when error stops going down on automatically held-out data (by default every 10th example is being held-out), multiple passes will eventually start to over-fit the held-out data (the "no free lunch" principle).