Benefitting from the Variables that Variable Selection Discards

Rich Caruana, Virginia R. de Sa; 3(Mar):1245-1264, 2003.

Abstract

In supervised learning variable selection is used to find a subset of the available inputs that accurately predict the output. This paper shows that some of the variables that variable selection discards can beneficially be used as extra outputs for inductive transfer. Using discarded input variables as extra outputs forces the model to learn mappings from the variables that were selected as inputs to these extra outputs. Inductive transfer makes what is learned by these mappings available to the model that is being trained on the main output, often resulting in improved performance on that main output.

We present three synthetic problems (two regression problems and one classification problem) where performance improves if some variables discarded by variable selection are used as extra outputs. We then apply variable selection to two real problems (DNA splice-junction and pneumonia risk prediction) and demonstrate the same effect: using some of the discarded input variables as extra outputs yields somewhat better performance on both of these problems than can be achieved by variable selection alone. This new approach enhances the benefit of variable selection by allowing the learner to benefit from variables that would otherwise have been discarded by variable selection, but without suffering the loss in performance that occurs when these variables are used as inputs.

[abs] [pdf] [ps.gz] [ps]