Analysis of Variance of Cross-Validation Estimators of the Generalization Error

Marianthi Markatou; Hong Tian; Shameek Biswas; George Hripcsak

This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random variables Y=Card(S_j ∩ S_j') and Y*=Card(S_j^c ∩ S_j'^c), where S_j, S_j' are two training sets, and S_j^c, S_j'^c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation.

Analysis of Variance of Cross-Validation Estimators of the Generalization Error

Abstract