RPHash: Scalable Big Data Clustering by Random Projection Hashing

RPHash is a streaming algorithm for data clustering based on frequent itemset counting of locality sensitive hash (LSH) collisions generated by multi-probe, randomly projected data vectors. The proposed algorithm called Random Projection Hashing or RPHash, trades sub-optimal computationally performance for improved memory efficiency over other clustering algorithms. Memory efficiency, a premium in streaming and distributed algorithms allows RPHash to be run on a variety of streaming data environments. Two additional features, temporal weighting and privacy are also provided intrinsically by RPHash.

The RPHash solution operates on static data as well as on streaming. An embedding of the code for Spark (map-reduce) processing is available. The essential components in RPHash approach are:

  1. Random projection: supported either by Johnson Lindenstrauss or Database-Friendly projections
  2. Locality Sensitive Hashing (LSH): multiple LSH algorithms implemented including: Spherical, Leech, E8, TreeWalk
  3. Gaussian blurring
  4. Candidate cluster counting: count-min sketch
  5. An offline candidate cluster merge step: kmeans, agglomerative

The main repositories for this project are maintained here: