## Algorithms for Modern Data Models

There are two threads in this research.#### Algorithms for Ternary Content Addressable Memories

A Content Addressable Memory (CAM) is an associative lookup memory containing a set of fixed-width cells that can hold arbitrary data bits. A CAM takes a search key as a query, and returns the address of the entry that contains the key, if any. In a Ternary CAM (TCAM), each data bit is capable of storing one of three states: 0, 1, or *, where a * matches both 0 and 1. This is a powerful primitive, and has found much use in high performance network routers and switches. This project will develop general techniques for using TCAMs to implement sophisticated randomized data structures, and use them in multiple data intensive applications.#### Distributed Algorithms at Scale

Map-Reduce and other modern processing paradigms have made it very easy to develop short pieces of software that solve substantial and complex problem from offline data. We believe that similar paradigms can be developed for real-time processing. We have developed algorithms for locality sensitive hashing, for keyword similarity given a background model, for social search, and for UNION-FIND which work well for either Map-Reduce or distributed real-time systems.This project is funded primarily by NSF grant 0915040.

### Project Participants

**PI**: Ashish Goel (Stanford)

**Senior Personnel:**Pankaj Gupta

**Graduate Students:**Rajendra Shinde, Bahman Bahmani.

### Results

Similarity Search and Locality Sensitive Hashing using TCAMs

Authors: Rajendra Shinde, Ashish Goel, Pankaj Gupta, Debojyoti Dutta

*Proceedings of ACM SIGMOD, 2010.*

*single*lookup into a Ternary Content Addressable Memory (TCAM) to solve the subset query problem for small sets, i.e., to check whether a given set (the query) contains (or alternately, is contained in) any one of a large collection of sets in a database. We use each TCAM entry as a small Ternary Bloom Filter (each `bit' of which is one of {0,1,*}) to store one of the sets in the collection. Like Bloom filters, our architecture is susceptible to false positives. Since each TCAM entry is quite small, asymptotic analyses of Bloom filters do not directly apply. Surprisingly, we are able to show that the asymptotic false positive probability formula can be safely used if we penalize the small Bloom filter by taking away just one bit of storage and adding just half an extra set element before applying the formula. We believe that this analysis is independently interesting. The subset query problem has applications in databases, network intrusion detection, packet classification in Internet routers, and Information Retrieval. We demonstrate our architecture on one illustrative streaming application -- intrusion detection in network traffic. By shingling the strings in the database, we can perform a single subset query, and hence a single TCAM search, to skip many bytes in the stream. We evaluate our scheme on the open source CLAM anti-virus database, for

*worst-case*as well as random streams. Our architecture appears to be at least one order of magnitude faster than previous approaches. Since the individual Bloom filters must fit in a single TCAM entry (currently 72 to 576 bits), our solution applies only when each set is of a small cardinality. However, this is sufficient for many typical applications. Also, recent algorithms for the subset-query problem use a small-set version as a subroutine.

Small Subset Queries and Bloom Filters Using Ternary Associative Memories, with Applications

Authors: Ashish Goel and Pankaj Gupta

*Proc. ACM SIGMETRICS, 2010.*

Our solution adapts to distributed, real-time implementations on Active Distributed Hash Tables (basic distributed stream processing systems) with O(1) network calls per search or update.

Partitioned Multi-Indexing: Bringing Order to Social Search

Authors: Bahman Bahmani and Ashish Goel

*Proc. WWW, 2012.*

Fast Incremental and Personalized PageRank

Authors: Bahman Bahmani, Abdur Chowdhury, and Ashish Goel

*Proc. VLDB, 2011.*