[SOLR-7082] Streaming Aggregation for SolrCloud - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 5.1, 6.0
Component/s: SolrCloud
Labels:
None

Description

This issue provides a general purpose streaming aggregation framework for SolrCloud. An overview of how it works can be found at this link:

http://heliosearch.org/streaming-aggregation-for-solrcloud/

This functionality allows SolrCloud users to perform operations that we're typically done using map/reduce or a parallel computing platform.

Here is a brief explanation of how the framework works:

There is a new Solrj io package found in: org.apache.solr.client.solrj.io

Key classes:

Tuple: Abstracts a document in a search result as a Map of key/value pairs.
TupleStream: is the base class for all of the streams. Abstracts search results as a stream of Tuples.
SolrStream: connects to a single Solr instance. You call the read() method to iterate over the Tuples.
CloudSolrStream: connects to a SolrCloud collection and merges the results based on the sort param. The merge takes place in CloudSolrStream itself.
Decorator Streams: wrap other streams to perform operations the streams. Some examples are the UniqueStream, MergeStream and ReducerStream.

Going parallel with the ParallelStream and "Worker Collections"

The io package also contains the ParallelStream, which wraps a TupleStream and sends it to N worker nodes. The workers are chosen from a SolrCloud collection. These "Worker Collections" don't have to hold any data, they can just be used to execute TupleStreams.

The StreamHandler

The Worker nodes have a new RequestHandler called the StreamHandler. The ParallelStream serializes a TupleStream, before it is opened, and sends it to the StreamHandler on the Worker Nodes.

The StreamHandler on each Worker node deserializes the TupleStream, opens the stream, iterates the tuples and streams them back to the ParallelStream. The ParallelStream performs the final merge of Metrics and can be wrapped by other Streams to handled the final merged TupleStream.

Sorting and Partitioning search results (Shuffling)

Each Worker node is shuffled 1/N of the document results. There is a "partitionKeys" parameter that can be included with each TupleStream to ensure that Tuples with the same partitionKeys are shuffled to the same Worker. The actual partitioning is done with a filter query using the HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in the filter cache which provides extremely high performance hash partitioning.

Many of the stream transformations rely on the sort order of the TupleStreams (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To accommodate this the search results can be sorted by specific keys. The "/export" handler can be used to sort entire result sets efficiently.

By specifying the sort order of the results and the partition keys, documents will be sorted and partitioned inside of the search engine. So when the tuples hit the network they are already sorted, partitioned and headed directly to correct worker node.

Extending The Framework

To extend the framework you create new TupleStream Decorators, that gather custom metrics or perform custom stream transformations.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-7082.patch
05/Feb/15 19:48
251 kB
Joel Bernstein
SOLR-7082.patch
17/Feb/15 19:43
251 kB
Joel Bernstein
SOLR-7082.patch
03/Mar/15 02:30
253 kB
Joel Bernstein
SOLR-7082.patch
09/Mar/15 21:54
256 kB
Joel Bernstein
SOLR-7082.patch
09/Mar/15 23:04
258 kB
Joel Bernstein

Issue Links

is duplicated by

SOLR-6526 Solr Streaming API

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joel Bernstein

Votes:: 4 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 05/Feb/15 19:46

Updated:: 08/Oct/19 14:51

Resolved:: 23/Apr/15 01:08