## On Semi-Supervised Linear Regression in Covariate Shift Problems

*Kenneth Joseph Ryan, Mark Vere Culp*; 16(Dec):3183−3217, 2015.

### Abstract

Semi-supervised learning approaches are trained using the full
training (labeled) data and available testing (unlabeled) data.
Demonstrations of the value of training with unlabeled data
typically depend on a smoothness assumption relating the
conditional expectation to high density regions of the marginal
distribution and an inherent missing completely at random
assumption for the labeling. So-called covariate shift poses a
challenge for many existing semi-supervised or supervised
learning techniques. Covariate shift models allow the marginal
distributions of the labeled and unlabeled feature data to
differ, but the conditional distribution of the response given
the feature data is the same. An example of this occurs when a
complete labeled data sample and then an unlabeled sample are
obtained sequentially, as it would likely follow that the
distributions of the feature data are quite different between
samples. The value of using unlabeled data during training for
the elastic net is justified geometrically in such practical
covariate shift problems. The approach works by obtaining
adjusted coefficients for unlabeled prediction which recalibrate
the supervised elastic net to compromise: (i) maintaining
elastic net predictions on the labeled data with (ii) shrinking
unlabeled predictions to zero. Our approach is shown to dominate
linear supervised alternatives on unlabeled response predictions
when the unlabeled feature data are concentrated on a low
dimensional manifold away from the labeled data and the true
coefficient vector emphasizes directions away from this
manifold. Large variance of the supervised predictions on the
unlabeled set is reduced more than the increase in squared bias
when the unlabeled responses are expected to be small, so an
improved compromise within the bias-variance tradeoff is the
rationale for this performance improvement. Performance is
validated on simulated and real data.

[abs][pdf][bib]