V-statistics and Variance Estimation

Zhengze Zhou; Lucas Mentch; Giles Hooker

As machine learning procedures become an increasingly popular modeling option among applied researchers, there has been a corresponding interest in developing valid tools for understanding their statistical properties and uncertainty. Tree-based ensembles like random forests remain one such popular option for which several important theoretical advances have been made in recent years by drawing upon a connection between their natural subsampled structure and the classical theory of $U$-statistics. Unfortunately, the procedures for estimating predictive variance resulting from these studies are plagued by severe bias and extreme computational overhead. Here, we argue that the root of these problems lies in the use of subsampling without replacement and that with-replacement subsamples, resulting in $V$-statistics, substantially alleviates these problems. We develop a general framework for analyzing the asymptotic behavior of $V$-statistics, demonstrating asymptotic normality under precise regularity conditions and establishing previously unreported connections to $U$-statistics. Importantly, these findings allow us to produce a natural and efficient means of estimating the variance of a conditional expectation, a problem of wide interest across multiple scientific domains that also lies at the heart of uncertainty quantification for supervised learning ensembles.

V-statistics and Variance Estimation

Abstract