## CrossCat: A Fully Bayesian Nonparametric Method for Analyzing Heterogeneous, High Dimensional Data

*Vikash Mansinghka, Patrick Shafto, Eric Jonas, Cap Petschulat, Max Gasner, Joshua B. Tenenbaum*; 17(138):1−49, 2016.

### Abstract

There is a widespread need for statistical methods that can
analyze high-dimensional datasets without imposing restrictive
or opaque modeling assumptions. This paper describes a domain-
general data analysis method called CrossCat. CrossCat infers
multiple non-overlapping views of the data, each consisting of a
subset of the variables, and uses a separate nonparametric
mixture to model each view. CrossCat is based on approximately
Bayesian inference in a hierarchical, nonparametric model for
data tables. This model consists of a Dirichlet process mixture
over the columns of a data table in which each mixture component
is itself an independent Dirichlet process mixture over the
rows; the inner mixture components are simple parametric models
whose form depends on the types of data in the table. CrossCat
combines strengths of mixture modeling and Bayesian network
structure learning. Like mixture modeling, CrossCat can model a
broad class of distributions by positing latent variables, and
produces representations that can be efficiently conditioned and
sampled from for prediction. Like Bayesian networks, CrossCat
represents the dependencies and independencies between
variables, and thus remains accurate when there are multiple
statistical signals. Inference is done via a scalable Gibbs
sampling scheme; this paper shows that it works well in
practice. This paper also includes empirical results on
heterogeneous tabular data of up to 10 million cells, such as
hospital cost and quality measures, voting records, unemployment
rates, gene expression measurements, and images of handwritten
digits. CrossCat infers structure that is consistent with
accepted findings and common-sense knowledge in multiple domains
and yields predictive accuracy competitive with generative,
discriminative, and model-free alternatives.

[abs][pdf][bib]