Single versus Multiple Sorting in All Pairs Similarity Search
Yasuo Tabei (JST Minato ERATO Project), Takeaki Uno (National
Institute of Informatics of Japan), Masashi Sugiyama (Tokyo Institute
of Technology), and Koji Tsuda (National Institute of Advanced
Industrial Science and Technology);
JMLR W&P 13:145-160, 2010.
Abstract
To save memory and improve speed, vectorial data such as images
and signals are often represented as strings of discrete symbols (i.e.,
sketches). Chariker (2002) proposed a fast approximate method for
finding neighbor pairs of strings by sorting and scanning with a small
window. This method, which we shall callgsingle sortingh, is applied
to locality sensitive codes and prevalently used in speed-demanding
web-related applications. To improve on single sorting, we propose a
novel method that employs blockwise masked sorting. Our method
can dramatically reduce the number of candidate pairs which have
to be verified by distance calculation in exchange with an increased
amount of sorting operations. So it is especially attractive for high
dimensional dense data, where distance calculation is expensive. Empirical
results show the efficiency of our method in comparison to
single sorting and recent fast nearest neighbor methods.