Home Page

Papers

Submissions

News

Scope

Editorial Board

Announcements

Proceedings

Open Source Software

Search

Login



RSS Feed

Scalable Simple Random Sampling and Stratified Sampling

Xiangrui Meng
;
JMLR W&CP 28 (3) : 531–539, 2013

Abstract

Analyzing data sets of billions of records has now become a regular task in many companies and institutions. In the statistical analysis of those massive data sets, sampling generally plays a very important role. In this work, we describe a scalable simple random sampling algorithm, named ScaSRS, which uses probabilistic thresholds to decide on the fly whether to accept, reject, or wait-list an item independently of others. We prove, with high probability, it succeeds and needs only \(O(\sqrt{k})\) storage, where \(k\) is the sample size. ScaSRS extends naturally to a scalable stratified sampling algorithm, which is favorable for heterogeneous data sets. The proposed algorithms, when implemented in MapReduce, can effectively reduce the size of intermediate output and greatly improve load balancing. Empirical evaluation on large-scale data sets clearly demonstrates their superiority.

Related Material