Design
Cleo is designed to support typeahead search for members' social network connections and activities. Cleo relies on Adjacency List, Bloom Filter and Forward Index to enable partial out-of-order string prefix match over a vary large number of elements by means of in-memory scanning and filtering. The design of Cleo can be summarized as:
- Adjacency List is for storing members' network connections.
- Bloom Filter provides a fast means to filter prefix mismatches.
- Forward Index is used to reject false positives from filtering.
During indexing, a tiny bloom filter (e.g., 4 bytes) is calculated against the pre-processed terms (e.g., {"jeff", "weiner"}) of an element using a predefined hash function h.
During search, the same function h is used to calculate the query bloom filter against the terms in a typeahead query (e.g., {"j", "wein"}).
The connections of the searcher are retrieved in the form of adjacency list in memory.
The connections are scanned to filter out prefix mismatch using the query bloom filter and the pre-computed bloom filter values for every member connected to the searcher.
Upon each bloom filter hit, the forward index is visited to reject any potential false positive prefix matches.
The design simplicity of Cleo makes it easy to combine and/or extend the three key components for varying application purposes. The Cleo library has provided several different typeahead search implementations which we will discuss shortly.
Basic API
Similar to general-purpose search libraries, Cleo provides separate interfaces for searching and indexing. The following classes define the basic API of Cleo.
- Element
- Searcher
- Indexer
- Connection
- ConnectionIndexer
An Element represents an entity that is characterized by a set of terms (words) and
is retrievable by the prefixes of each known term. The terms of an element may be pre-processed
(e.g., lowercase) depending on application needs. Elements are retrieved and included in
typeahead search results.
The Searcher API defines a number of methods for retrieving elements (i.e., search results) from
the underlying indexes that are populated by an Indexer in real-time.
The Indexer is for indexing various types of elements. In general, elements need to be serializable
so that they can be stored persistently.
The Connection is for representing different social relationships (e.g., friends, network connections, group associations).
The ConnectionIndexer is for indexing various types of connections.
Core Typeahead Classes
There are different implementations for typeahead search in Cleo.
The core Typeahead classes listed below all implement the Searcher and Indexer interfaces.
Generic typeahead search is typically implemented as the subclasses of AbstractTypeahead.
In contrast, NetworkTypeahead need to maintain various social network connections to support typeahead search over a person's social network.
Therefore, the subclasses of NetworkTypeahead implement the ConnectionIndexer interface.
The following figure shows the class inheritance hierarchy of Searcher.
- GenericTypeahead
- ScannerTypeahead
- VanillaNetworkTypeahead
- WeightedNetworkTypeahead
The GenericTypeahead is designed for large data sets, which may contain millions of elements,
in the absence of social network connections. It is built on Inverted Index
(similar to the adjacency list in the sense that a term is associated with a list of element IDs),
Bloom Filter and Forward Index.
When evaluating a query, the typeahead selects the shortest list associated with the prefixes of query terms
from the inverted index. The selected list is scanned. The GenericTypeahead applies bloom filter to reject prefix mismatches, and
visits the forward index to reject false positive matches left by the bloom filter.
The typeahead search results are collected and then ranked according to their popularity
such as the number of clicks or page views.
The ScannerTypeahead is suitable for small data sets that contain no more than 500,000 elements.
It uses Bloom Filter and Forward Index but not Adjacency List.
When evaluating a query, the ScannerTypeahead scans all the elements,
applies bloom filter to reject prefix mismatches, and
visits the forward index to reject false positive matches left by the bloom filter.
The VanillaNetworkTypeahead provides a basic implementation for Network Typeahead using the three key techniques
Adjacency List, Bloom Filter and Forward Index in the presence of social network
such as professional network connections, friendship and followership. A person's network is
represented as an adjacency list (i.e. a mapping from a source ID to a list of target IDs).
There is no further information about how strong a relationship between two members can be.
So the concept of closeness is not available.
When evaluating a query from a searcher, the VanillaNetworkTypeahead retrieves and scans the searcher's
social network connections from the adjacency list, applies bloom filter to reject prefix mismatches, and
visits the forward index to reject false positive matches left by the bloom filter.
The WeightedNetworkTypeahead improves upon VanillaNetworkTypeahead by
extending Adjacency List with a integer weight defined for each social relationship in the list.
The closeness of a social relationship can be modeled and then used to rank typeahead search results for better
relevance. In fact, the LinkedIn 1st and 2nd degree network connections typeahead leverages the connection scores
from the PYMK (People You May Know) to make typeahead search results relevant.
When evaluating a query from a searcher, the VanillaNetworkTypeahead retrieves and scans the searcher's
social network connections from the adjacency list, applies bloom filter to reject prefix mismatches, and
visits the forward index to reject false positive matches left by the bloom filter. The typeahead search results
are collected and then ranked based on their corresponding connection scores from the adjacency list.