I’ve been playing around with convergence for my SGD implementation for the upcoming LingPipe 3.5, in the context of the 2008 i2b2 Obesity Challenge, the full title of which is "Second i2b2 Shared-Task and Workshop; Challenges in Natural Language Processing for Clinical Data; Obesity Challenge (A Shared-Task on Obesity): Who’s obese and what co-morbidities do they (definitely/likely) have?". Participants have until April 15, 2008 to register to participate.

Slide 37 of Léon Bottou’s NIPS tutorial Learning with Large Datasets reveals the "secret ingredient" behind (his) successful stochastic gradient search (where η is the learning rate, which he calls a "gain"):

At any moment during training, we can:

- Select a small subsample of examples.
- Try various gains η on the subsample.
- Pick the gain η that most reduces the cost.
- Use it for the next 100000 iterations on the full dataset.

This is a kind of meta-descent algorithm, the most well known of which is Nicolas Shraudolph’s Stochastic Meta-Descent. I’m going to try Bottou’s approach tomorrow. But I’m skeptical it’ll work very well, because the optimal learning rate changes pretty quickly through the corpus, so balancing batch size and number of rates to probe may prove tricky. What I need is a control stick, sort of like David MacKay et al.’s Dasher interface for convergence.

Until today, I’ve been using a simple annealing schedule to set the learning rate. The good news is that the rumors were true — sparse SGD is very fast. The bad news is that driving the SGD implementation is like bobsledding on a steep hill in the dark over hilly terrain (that reference was for you, Jeff). Too high a rate, and the first bump’ll blow you off the hill; too heavy a hand on the brakes, and it’ll take months to get to the bottom. Just right and you’re there in a few hundred iterations.

Today, I played around with the so-called "reckless driver" heuristic, which increases the learning rate for the next epoch if the current epoch lowered error, and decreases it if the overall error goes up. It helps. Most of the time. I’m undecided about whether to include it in the release and if so, whether or not to parameterize the acceleration/deceleration terms.

If you can get close to a solution early on with a fairly low learning rate, turning the rate up later on seems to help. Yes, I know that eventually the learning rate has to decrease to get within epsilon of a solution. So much for asymptotics and unbounded number of steps in the theoretical analysis.

I’m reporting total error and breaking out the log likelihood component. What’s interesting to me is that the log likelihood goes way down (almost to zero, indicating the data’s pretty much separable), then starts climbing up as regularization kicks in to shrink all the parameters back down. I’m wondering if separate learning rates for regularization and likelihood gradient contributions make sense.

I’m also realizing why the "early stopping" heuristic was introduced as a poor man’s regularization. Even with proper Bayesian regularization, the optimal solution under cross-validation for many choices of prior is not the maximum a posteriori solution. For instance, if I take a prior Gaussian with a variance of 2.0 for every dimension, the MAP solution is inferior to "early stopping". Hello, empirical Bayes. And perhaps dimension-specific priors, as in Genkin, Madigan and Lewis’s Bayesian Multinomial Regression Software, which lets you get the effects of variance normalizing the input dimensions.

The other good news is that (at least under most parameterizations) regularized logistic regression definitely works better than naive Bayes or our character language model or TF/IDF classifiers. As I suspected all along, logistic regression, even with just bag-of-word or bag-of-character-ngrams features, is incredibly sensitive to priors, and the task has incredibly high variance under cross-validation no matter which classifiers we use.

On the very next page of his tutorial (38), Bottou describes the batch version of regularized SGD to maintain sparsity in the per-input gradient updates. I independently came up with the same algorithm, and then went on to derive my fully lazy implementation which preserves stochasticity.

April 10, 2008 at 9:22 am |

Cross-validation is not a word of the Truth of God. It’s just another prior, a non-parametric one. Say, if you use leave-one-out, a form of CV, you will overfit bigtime for most practical applications. Now, once you have a relatively robust prior, such as the Gaussian with variance of 2 (which is maybe a bit too informative for my taste) – the improvement you’d get with CV-based hyperparameter selection is often quite negligible in any kind of meaningful utility.

April 10, 2008 at 11:06 am |

Your comment is like deja vu, Aleks. Breck and I were just having this conversation yesterday. I agree that there’re far too many features and not nearly enough data to do feature selection by cross-validation.

I’m just using cross-validation the way you in the paper of yours I cited in the previous blog entry in this thread, How Fat is Your (Prior’s) Tail?.

In some sense, the only thing I care about is performance on some held out test set. I don’t really care about minimizing the error function per se, except insofar as it leads to better held out performance. Cross-validation’s the best way I know to estimate that, but I’d love to hear other ideas.

On those lines, I’m finding the Laplace prior clearly superior, as suggested by Goodman and Genkin et al. (same prior blog post has the refs). And I’m seeing about the same magnitude of effect as reported by Genkin et al. — 70% vs. 82% accuracy on the i2b2 Obesity challenge data using the same bag-of-word features selected by a count threshold)

What’s interesting is that Goodman did the posterior histogram of parameters and saw they looked like a Laplace distribution. So using the Laplace as a prior is a truly Bayesian maneuver in this setting in that it encodes prior knowledge about likely parameter values.

April 13, 2008 at 2:53 am |

Yes, we did a posterior histogram too on a different corpus, and what we saw was more like a Cauchy, but with a distinguished spike at 0.

One thing we want to do is to tackle the prior inference in a proper Bayesian way instead of going through with CV. But there is some dependence on the parameters of CV: a 2-fold CV will favor a more informative prior than a leave-one-out.

Oh yes, one more thing. I stabilize CV by replicating it many times. For example, I do 5 runs of 5-fold CV, amounting to 25 test/train splits.

April 14, 2008 at 2:29 pm |

The only fitting I’m doing (so far) with cross-validation is of one parameter — the prior variance/scale. Oh, and which sets of features to use at a coarse scale, like tokens vs. case-normalized tokens vs. stemmed case-normalized tokens vs. character n-grams. And one more thing, now that I think about it — minimum counts for feature selection.

The problems we’re working on are pretty unstable in the sense of being nearly separable on some folds, so replication is really critical to get any predictive accuracy of held-out performance.

Luckily, with the Laplace prior being used the cross-validation accuracies have much less variance across random folds than with maximum likelihood or even Gaussian priors of the same variance.

The relation of parameters to training sample sizes is a big concern with cross-validation, especially since the usual advice is to fix hyperparameters with cross-validation and then train on all the data (not just the size of a fold).

April 30, 2008 at 12:29 pm |

Bob, I am very interested to hear any future experiments you have about scheduling the SGD learning rate. It has also turned out to be a nuisance in my experiments.

April 30, 2008 at 2:13 pm |

I found the exponential annealing schedule to converge faster and more accurately than the inverse schedule.

I found that Bottou’s suggestion of finding the best learning rate and then going with that to be unworkable because the line search for the best learning rate is highly non-convex, with good solutions at all sorts of distances that are good for the next 100K examples but bad for the next 100K examples after that.

I went to the extreme of developing a GUI-based control where

Icould drive the rate, but I really couldn’t do any better than the exponential (and who wants to spend a couple minutes doing this)? It is informative to see what’s going on with the GUI, though. If you leave the rate constant, improvements gradually dwindle to nothing, then pick up again as the rate is lowered.I found classifier accuracy wasn’t very sensitive to how many decimal places of accuracy were in the coefficients, so in practice, this may not matter much.