Cross-validation for model selection

One of the most frequently cited papers in model selection would be An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion by M. Stone, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1 (1977), pp. 44-47.
(Akaike’s 1974 paper, introducing Akaike Information Criterion (AIC), is the most often cited paper in the subject of model selection).

The popularity of AIC comes from its simplicity. By penalizing the log of maximum likelihood with the number of model parameters (p), one can choose the best model that describes/generates the data. Nonetheless, we know that AIC has its shortcoming: all candidate models are nested each other and come from the same parametric family. For an exponential family, the trace of multiplication of score function and Fisher information becomes equivalent to the number of parameters, where you can easily raise a question, “what happens when the trace cannot be obtained analytically?”

The general form of AIC is called TIC (Takeuchi’s information criterion, Takeuchi, 1976), where the penalized term is written as the trace of multiplication of score function and Fisher information. Still, I haven’t answered to the question above.

I personally think that a trick to avoid such dilemma is the key content of Stone (1974), using cross-validation. Stone proved that computing the log likelihood by cross-validation is equivalent to AIC, without computing the score function and Fisher information or getting an exact estimate of the number of parameters. Cross-validation enables to obtain the penalized maximum log likelihoods across models (penalizing is necessary due to estimating the parameters) so that comparison among models for selection becomes feasible while it elevates worries of getting the proper number of parameters (penalization).

Numerous tactics are available for the purpose of model selection. Although variable selection (candidate models are generally nested) is a very hot topic in statistics these days and tones of publication could be found, when it comes to applying resampling methods to model selection, there are not many works. As Stone proved, cross-validation relieves any difficulties of calculating the score function and Fisher information of a model. I was working on non-nested model selection (selecting a best model from different parametric families) with Jackknife with Prof. Babu and Prof. Rao at Penn State until last year (paper hasn’t submitted yet) based on finding that the Jackknife enables to get the unbiased maximum likelihood. Even though high cost of computation compared to cross-validation and the jackknife, the bootstrap has occasionally appeared for model selection.

I’m not sure cross-validation or the jackknife is a feasible approach to be implemented in astronomical softwares, when they compute statistics. Certainly it has advantages when it comes to calculating likelihoods, like Cash statistics.

1. vlk:

How exactly does one apply cross-validation to AIC? Also, what is a “score function” in this context?

08-20-2007, 12:41 am
2. hlee:

Did I say “apply cross-validation to AIC?” Hmm… To clarify, apply cross-validation to model selection!!!

LOO(Leave One Out) is an expression that Prof. Rao often said. For a maximum likelihood calculation, leave one observation out, compute the maximum likelihood (ML) and the ML estimator (MLE) with the rest. With the one left observation and the MLE, a likelihood of that observation is obtained and repeat this process for all observations. Asymptotically, calculating the likelihood by LOO is equivalent to AIC.

Instead of “score function,” I’d rather use J function but this single letter gives more ambiguity. Here, the score function means the expectation of the first derivative of the log likelihood at the true parameter. Fisher information involves the 2nd order derivation and there are cases that the analytic forms of such derivations are not available, where cross validation could replace AIC or TIC.

One drawback would be computation time O(n) if AIC is O(1). For binned/clipped data, this increment could be nothing but if we happened to keep all 1078 channels and adopting a complicated model for MLEs, we’d better not to use resampling methods without smart optimization tools.

08-20-2007, 3:44 am
3. vlk:

When you Leave One Out, the thing you are leaving out is a datum, correct? So then you have a series of MLEs of the parameter, one for each datum left out. What next? How do they get combined and how do you then go from parameter estimation to model selection?

08-21-2007, 5:01 pm
4. hlee:

Inference on parameters is not a main subject for model selection. Once a model is chosen then we can move on to estimating parameters or hypothesis testing. However, to my understanding, there are not many works combining model selection and inference for a general application. Bayesian may think differently because they adopt models in choosing priors and likelihoods to build chains. My aim of model selection is whether I should choose a thermal or non-thermal model (I hope the choice of words is correct).

To answer your questions, leaving one datum out and having a series of MLEs is correct. One computes the likelihood on each datum with the MLE obtained without that datum, then add the all likelihoods computed with one datum will be the likelihood computed via cross-validation (CV). Next, one choose one model among candidates based on likelihoods of different models (if CV is chosen), or information criteria (many of which are maximum likelihoods + penalty). Once the model is chosen, the we can move to the inference step; however, this needs some care.

To prevent a little confusion, estimating parameters (getting MLEs) is a by-product to get the maximum likelihoods when the model selection is the goal of the study. Andrew Liddle and his colleagues have been writing papers on model selection applied to cosmology. Their papers may help to understand how statistical model selection is applied to astronomy, although their model selection methods are limited to BIC and DIC. I had a feeling that Protossov et al (2001) just scratched the surface of model selection and didn’t let people to taste the fruit. Yet, it’s a good reference because of its appendix, at least.

08-22-2007, 11:57 am
5. hlee:

In addition, the maximum likelihood is not the only statistics for model selection. Nonetheless, the popularity seems to be originated from the fact that Boltman’s maximum entropy, Shannon’s information theory, and Fisher’s maximum likelihood principle are equivalent.

08-22-2007, 12:02 pm