Solr
  1. Solr
  2. SOLR-7082

Streaming Aggregation for SolrCloud

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      This issue provides a general purpose streaming aggregation framework for SolrCloud. An overview of how it works can be found at this link:

      http://heliosearch.org/streaming-aggregation-for-solrcloud/

      This functionality allows SolrCloud users to perform operations that we're typically done using map/reduce or a parallel computing platform.

      Here is a brief explanation of how the framework works:

      There is a new Solrj io package found in: org.apache.solr.client.solrj.io

      Key classes:

      Tuple: Abstracts a document in a search result as a Map of key/value pairs.
      TupleStream: is the base class for all of the streams. Abstracts search results as a stream of Tuples.
      SolrStream: connects to a single Solr instance. You call the read() method to iterate over the Tuples.
      CloudSolrStream: connects to a SolrCloud collection and merges the results based on the sort param. The merge takes place in CloudSolrStream itself.
      Decorator Streams: wrap other streams to perform operations the streams. Some examples are the UniqueStream, MergeStream and ReducerStream.

      Going parallel with the ParallelStream and "Worker Collections"

      The io package also contains the ParallelStream, which wraps a TupleStream and sends it to N worker nodes. The workers are chosen from a SolrCloud collection. These "Worker Collections" don't have to hold any data, they can just be used to execute TupleStreams.

      The StreamHandler

      The Worker nodes have a new RequestHandler called the StreamHandler. The ParallelStream serializes a TupleStream, before it is opened, and sends it to the StreamHandler on the Worker Nodes.

      The StreamHandler on each Worker node deserializes the TupleStream, opens the stream, iterates the tuples and streams them back to the ParallelStream. The ParallelStream performs the final merge of Metrics and can be wrapped by other Streams to handled the final merged TupleStream.

      Sorting and Partitioning search results (Shuffling)

      Each Worker node is shuffled 1/N of the document results. There is a "partitionKeys" parameter that can be included with each TupleStream to ensure that Tuples with the same partitionKeys are shuffled to the same Worker. The actual partitioning is done with a filter query using the HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in the filter cache which provides extremely high performance hash partitioning.

      Many of the stream transformations rely on the sort order of the TupleStreams (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To accommodate this the search results can be sorted by specific keys. The "/export" handler can be used to sort entire result sets efficiently.

      By specifying the sort order of the results and the partition keys, documents will be sorted and partitioned inside of the search engine. So when the tuples hit the network they are already sorted, partitioned and headed directly to correct worker node.

      Extending The Framework

      To extend the framework you create new TupleStream Decorators, that gather custom metrics or perform custom stream transformations.

      1. SOLR-7082.patch
        258 kB
        Joel Bernstein
      2. SOLR-7082.patch
        256 kB
        Joel Bernstein
      3. SOLR-7082.patch
        253 kB
        Joel Bernstein
      4. SOLR-7082.patch
        251 kB
        Joel Bernstein
      5. SOLR-7082.patch
        251 kB
        Joel Bernstein

        Issue Links

          Activity

          Hide
          Joel Bernstein added a comment -

          The initial patch includes a fully operational parallel streaming framework with tests.

          It's a fairly large patch so I'll be updating this ticket with details about the design and code.

          Show
          Joel Bernstein added a comment - The initial patch includes a fully operational parallel streaming framework with tests. It's a fairly large patch so I'll be updating this ticket with details about the design and code.
          Hide
          Joel Bernstein added a comment -

          Works with latest updates to Trunk

          Show
          Joel Bernstein added a comment - Works with latest updates to Trunk
          Hide
          Otis Gospodnetic added a comment -

          This looks really nice, Joel. 2 questions:

          • this looks a lot like ES aggregations. Have you maybe made any comparisons in terms of speed or memory footprint? (ES aggregations love heap)
          • is this all going to land in Solr or will some of it remain in Heliosearch?
          Show
          Otis Gospodnetic added a comment - This looks really nice, Joel. 2 questions: this looks a lot like ES aggregations. Have you maybe made any comparisons in terms of speed or memory footprint? (ES aggregations love heap) is this all going to land in Solr or will some of it remain in Heliosearch?
          Hide
          Joel Bernstein added a comment -

          Hi Otis,

          Sorry about the slow response, just got back from vacation and still catching up. I'll be writing more about how Streaming aggregation works this week. Here are some thoughts on your questions:

          1) This ticket is focused on providing fast streaming Map/Reduce like functionality. Streams can be sorted and partitioned strategically to minimized the amount of memory needed to perform aggregations and transformations. It should be fairly responsive because it pushes most of the work (record selection, sorting, partitioning) into the search the engine. So records go straight from the search engine to the correct worker node to be reduced. These techniques won't be as fast as faceting, but it will support a very wide range of use cases.

          2) I aiming to get this into Solr trunk soon with eye towards having this ready to go for Solr 5.1

          Show
          Joel Bernstein added a comment - Hi Otis, Sorry about the slow response, just got back from vacation and still catching up. I'll be writing more about how Streaming aggregation works this week. Here are some thoughts on your questions: 1) This ticket is focused on providing fast streaming Map/Reduce like functionality. Streams can be sorted and partitioned strategically to minimized the amount of memory needed to perform aggregations and transformations. It should be fairly responsive because it pushes most of the work (record selection, sorting, partitioning) into the search the engine. So records go straight from the search engine to the correct worker node to be reduced. These techniques won't be as fast as faceting, but it will support a very wide range of use cases. 2) I aiming to get this into Solr trunk soon with eye towards having this ready to go for Solr 5.1
          Hide
          Otis Gospodnetic added a comment -

          Thanks Joel. Re 1) – but conceptually and functionally speaking, would you say this is more or less the same as ES aggregations?

          Show
          Otis Gospodnetic added a comment - Thanks Joel. Re 1) – but conceptually and functionally speaking, would you say this is more or less the same as ES aggregations?
          Hide
          Joel Bernstein added a comment -

          I believe this is more closely comparable to technologies that shuffle, like Map/Reduce.

          Show
          Joel Bernstein added a comment - I believe this is more closely comparable to technologies that shuffle, like Map/Reduce.
          Hide
          Yonik Seeley added a comment -

          but conceptually and functionally speaking, would you say this is more or less the same as ES aggregations?

          I don't think so. The heliosearch JSON Facet API looks a lot more like ES aggregations? Streaming aggregations is a more general purpose distributed computation framework.

          Show
          Yonik Seeley added a comment - but conceptually and functionally speaking, would you say this is more or less the same as ES aggregations? I don't think so. The heliosearch JSON Facet API looks a lot more like ES aggregations? Streaming aggregations is a more general purpose distributed computation framework.
          Hide
          Joel Bernstein added a comment - - edited

          Patch with all tests passing.

          Show
          Joel Bernstein added a comment - - edited Patch with all tests passing.
          Hide
          Joel Bernstein added a comment -

          Latest work includes more tests, and convenience methods on the Tuple class. Also the HashQParserPlugin doesn't get the hashcode directly from the BytesRef, it converts to CharsRef first using StrField.indexedToReadable.

          Show
          Joel Bernstein added a comment - Latest work includes more tests, and convenience methods on the Tuple class. Also the HashQParserPlugin doesn't get the hashcode directly from the BytesRef, it converts to CharsRef first using StrField.indexedToReadable.
          Hide
          Joel Bernstein added a comment -

          New patch with precommit passing

          Show
          Joel Bernstein added a comment - New patch with precommit passing
          Hide
          ASF subversion and git services added a comment -

          Commit 1665391 from Joel Bernstein in branch 'dev/trunk'
          [ https://svn.apache.org/r1665391 ]

          SOLR-7082: Streaming Aggregation for SolrCloud

          Show
          ASF subversion and git services added a comment - Commit 1665391 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1665391 ] SOLR-7082 : Streaming Aggregation for SolrCloud
          Hide
          Ramkumar Aiyengar added a comment -

          Haven't looked at the patch in great detail, but looks like the SolrJ side could use a few tests? There's a new package there but with no tests?

          Show
          Ramkumar Aiyengar added a comment - Haven't looked at the patch in great detail, but looks like the SolrJ side could use a few tests? There's a new package there but with no tests?
          Hide
          Joel Bernstein added a comment -

          The initial set of tests are here:
          https://svn.apache.org/viewvc/lucene/dev/trunk/solr/solrj/src/test/org/apache/solr/client/solrj/io/StreamingTest.java

          We can break these out to smaller files also.

          Show
          Joel Bernstein added a comment - The initial set of tests are here: https://svn.apache.org/viewvc/lucene/dev/trunk/solr/solrj/src/test/org/apache/solr/client/solrj/io/StreamingTest.java We can break these out to smaller files also.
          Hide
          Ramkumar Aiyengar added a comment -

          Missed that, thanks Joel..

          Show
          Ramkumar Aiyengar added a comment - Missed that, thanks Joel..
          Hide
          ASF subversion and git services added a comment -

          Commit 1669164 from Joel Bernstein in branch 'dev/trunk'
          [ https://svn.apache.org/r1669164 ]

          SOLR-7082: Streaming Aggregation for SolrCloud

          Show
          ASF subversion and git services added a comment - Commit 1669164 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1669164 ] SOLR-7082 : Streaming Aggregation for SolrCloud
          Hide
          Joel Bernstein added a comment -

          In the latest commit a few stream implementations are removed to focus on a core set of foundational streams for the initial release.

          Show
          Joel Bernstein added a comment - In the latest commit a few stream implementations are removed to focus on a core set of foundational streams for the initial release.
          Hide
          ASF subversion and git services added a comment -

          Commit 1669212 from Joel Bernstein in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1669212 ]

          SOLR-7082 SOLR-7224 SOLR-7225: Streaming Aggregation for SolrCloud

          Show
          ASF subversion and git services added a comment - Commit 1669212 from Joel Bernstein in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1669212 ] SOLR-7082 SOLR-7224 SOLR-7225 : Streaming Aggregation for SolrCloud
          Hide
          ASF subversion and git services added a comment -

          Commit 1669343 from Joel Bernstein in branch 'dev/trunk'
          [ https://svn.apache.org/r1669343 ]

          SOLR-7082: update CHANGES.txt

          Show
          ASF subversion and git services added a comment - Commit 1669343 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1669343 ] SOLR-7082 : update CHANGES.txt
          Hide
          ASF subversion and git services added a comment -

          Commit 1669344 from Joel Bernstein in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1669344 ]

          SOLR-7082: update CHANGES.txt

          Show
          ASF subversion and git services added a comment - Commit 1669344 from Joel Bernstein in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1669344 ] SOLR-7082 : update CHANGES.txt
          Hide
          ASF subversion and git services added a comment -

          Commit 1669554 from Joel Bernstein in branch 'dev/trunk'
          [ https://svn.apache.org/r1669554 ]

          SOLR-7082: Editing Javadoc

          Show
          ASF subversion and git services added a comment - Commit 1669554 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1669554 ] SOLR-7082 : Editing Javadoc
          Hide
          ASF subversion and git services added a comment -

          Commit 1669557 from Joel Bernstein in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1669557 ]

          SOLR-7082: Editing Javadoc

          Show
          ASF subversion and git services added a comment - Commit 1669557 from Joel Bernstein in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1669557 ] SOLR-7082 : Editing Javadoc
          Hide
          ASF subversion and git services added a comment -

          Commit 1670176 from Joel Bernstein in branch 'dev/trunk'
          [ https://svn.apache.org/r1670176 ]

          SOLR-7082: Syntactic sugar for metric gathering

          Show
          ASF subversion and git services added a comment - Commit 1670176 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1670176 ] SOLR-7082 : Syntactic sugar for metric gathering
          Hide
          ASF subversion and git services added a comment -

          Commit 1670181 from Joel Bernstein in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1670181 ]

          SOLR-7082: Syntactic sugar for metric gathering

          Show
          ASF subversion and git services added a comment - Commit 1670181 from Joel Bernstein in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1670181 ] SOLR-7082 : Syntactic sugar for metric gathering

            People

            • Assignee:
              Unassigned
              Reporter:
              Joel Bernstein
            • Votes:
              4 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development