Sparse Semiparametric Discriminant Analysis for High-dimensional Zero-inflated Data
Hee Cheol Chung, Yang Ni, Irina Gaynanova.
Year: 2025, Volume: 26, Issue: 210, Pages: 1−54
Abstract
Sequencing-based technologies provide an abundance of high-dimensional biological data sets with highly skewed and zero-inflated measurements. Despite the computational efficiency and high interpretability offered by linear classification methods, the violation of underlying distribution assumptions, driven by high skewness and zero inflation, results in invalid classification rules and interpretations. Furthermore, existing data transformation methods addressing these violations introduce ambiguity, rendering the final model and classification performance contingent on the specific transformation employed. To tackle these challenges, we propose a novel semiparametric framework for discriminant analysis based on the truncated latent Gaussian copula model. This model accommodates skewness and zero inflation, and its estimation procedure ensures robustness against data transformations. To facilitate model interpretability, we incorporate $\ell_1$ sparsity regularization and establish the consistency of the classification directions in high-dimensional settings. We validate our approach using human gut microbiome, breast cancer microRNA, and single-cell RNA sequencing data, highlighting its superior classification accuracy and robustness to data transformations.