Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8487

Add CommitStream to Streaming API and Streaming Expressions

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Implemented
    • Affects Version/s: 6.3
    • Fix Version/s: 6.3
    • Component/s: None
    • Labels:
      None

      Description

      (Paraphrased from Joel's idea/suggestions in the comments of SOLR-7535).

      With SOLR-7535, users can now index documents/tuples using an UpdateStream. However, there's no way currently using the Streaming API to force a commit on the collection that received these updates.

      The purpose of this ticket is to add a CommitStream, which can be used to trigger commit(s) on a given collection.

      The proposed usage/behavior would look a little bit like:
      commit(collection, parallel(update(search()))

      Note that...
      1.) CommitStream has a positional collection parameter, to indicate which collection to commit on. (Alternatively, it could recurse through children() nodes until it finds the UpdateStream, and then retrieve the collection from the UpdateStream).
      2.) CommitStream forwards all tuples received by an underlying, wrapped stream.
      3.) CommitStream commits when the underlying stream emits its EOF tuple. (Alternatively, it could commit every X tuples, based on a parameter).

      1. SOLR-8487.patch
        33 kB
        Dennis Gove
      2. SOLR-8487.patch
        13 kB
        Dennis Gove

        Issue Links

          Activity

          Hide
          joel.bernstein Joel Bernstein added a comment -

          Closed this issue by mistake and then re-opened.

          Show
          joel.bernstein Joel Bernstein added a comment - Closed this issue by mistake and then re-opened.
          Hide
          dpgove Dennis Gove added a comment -

          I'm working on this. Hoping to have a first draft in a day or two.

          Show
          dpgove Dennis Gove added a comment - I'm working on this. Hoping to have a first draft in a day or two.
          Hide
          dpgove Dennis Gove added a comment -

          No tests (will add soon).
          Doesn't add any summary tuples - not sure if these are necessary.

          Supports setting batchSize, waitFlush, waitSearcher, and softCommit settings.

          Show
          dpgove Dennis Gove added a comment - No tests (will add soon). Doesn't add any summary tuples - not sure if these are necessary. Supports setting batchSize, waitFlush, waitSearcher, and softCommit settings.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Looks good!

          Wondering if we should tie this more closely with the update stream. A couple of possibilities:

          1) The update stream returns a tuple with each batch, which includes the batch size. Should we use that to calculate when to commit?
          2) We could have the update stream add the collection to its outgoing tuples and then use that, instead of specifying the collection as a parameter to the commit function.

          Show
          joel.bernstein Joel Bernstein added a comment - Looks good! Wondering if we should tie this more closely with the update stream. A couple of possibilities: 1) The update stream returns a tuple with each batch, which includes the batch size. Should we use that to calculate when to commit? 2) We could have the update stream add the collection to its outgoing tuples and then use that, instead of specifying the collection as a parameter to the commit function.
          Hide
          dpgove Dennis Gove added a comment -

          I'm not a huge fan of tying two streams together like that (ie, one is dependent on the other). If we wanted to tie update and commit more closely I'd rather see the commit as an operation inside the UpdateStream like

          update(foo, stream(...), batchSize=#, commit(nBatches/batchSize/time))
          
          Show
          dpgove Dennis Gove added a comment - I'm not a huge fan of tying two streams together like that (ie, one is dependent on the other). If we wanted to tie update and commit more closely I'd rather see the commit as an operation inside the UpdateStream like update(foo, stream(...), batchSize=#, commit(nBatches/batchSize/time))
          Hide
          risdenk Kevin Risden added a comment -

          I like the commit() outside update(). This makes commit look more like a count or something similar. One thing that may be useful is amount of time passed (I know this makes it harder):

          Lets say the underlying stream is a daemon that happens every 30 seconds. If you set the batch size to 1 that would work but maybe you want to commit every 1000 tuples or every 5 minutes.

          I guess at that point you could instead have Solr doing the auto commit. Just a thought.

          Show
          risdenk Kevin Risden added a comment - I like the commit() outside update(). This makes commit look more like a count or something similar. One thing that may be useful is amount of time passed (I know this makes it harder): Lets say the underlying stream is a daemon that happens every 30 seconds. If you set the batch size to 1 that would work but maybe you want to commit every 1000 tuples or every 5 minutes. I guess at that point you could instead have Solr doing the auto commit. Just a thought.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          The original patch had the update function do it's own commits. But it was taken out because if an expression is doing parallel updates there would be multiple workers committing at the same time. So the commit function is needed to support this scenario:

          commit(parallel(update(search())))
          

          So I think we're left with the choice of using the data in the tuples returned by the update stream, or leaving it decoupled.

          Show
          joel.bernstein Joel Bernstein added a comment - The original patch had the update function do it's own commits. But it was taken out because if an expression is doing parallel updates there would be multiple workers committing at the same time. So the commit function is needed to support this scenario: commit(parallel(update(search()))) So I think we're left with the choice of using the data in the tuples returned by the update stream, or leaving it decoupled.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          I'll be on vacation next week, so I don't want to hold things up, if you're feeling good about your current approach.

          Show
          joel.bernstein Joel Bernstein added a comment - I'll be on vacation next week, so I don't want to hold things up, if you're feeling good about your current approach.
          Hide
          dpgove Dennis Gove added a comment -

          I just realized I had a fundamental misunderstanding of the UpdateStream. I thought it was returning all source tuples on a call to read() but that is not the case. It is instead sending a batch of source tuples into the destination collection, dropping them, and then returning a summary tuple.

          This will change some of the implementation details of the CommitStream.

          Show
          dpgove Dennis Gove added a comment - I just realized I had a fundamental misunderstanding of the UpdateStream. I thought it was returning all source tuples on a call to read() but that is not the case. It is instead sending a batch of source tuples into the destination collection, dropping them, and then returning a summary tuple. This will change some of the implementation details of the CommitStream.
          Hide
          dpgove Dennis Gove added a comment -

          Properly handles the UpdateStream summary tuple and takes that into account when sizing batches. The CommitStream will not swallow any tuples (ie, all tuples it reads, whether summary or not, will be returned to any wrapping stream).

          Adds commit, parallelCommit, and daemonParallelCommit tests.

          I believe this is ready to go.

          Show
          dpgove Dennis Gove added a comment - Properly handles the UpdateStream summary tuple and takes that into account when sizing batches. The CommitStream will not swallow any tuples (ie, all tuples it reads, whether summary or not, will be returned to any wrapping stream). Adds commit, parallelCommit, and daemonParallelCommit tests. I believe this is ready to go.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          +1 looks good

          Show
          joel.bernstein Joel Bernstein added a comment - +1 looks good
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 6365920a0e9ed3bf0b13b90955cd73535d495f9a in lucene-solr's branch refs/heads/master from Dennis Gove
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6365920 ]

          SOLR-8487: Adds CommitStream to support sending commits to a collection being updated

          Show
          jira-bot ASF subversion and git services added a comment - Commit 6365920a0e9ed3bf0b13b90955cd73535d495f9a in lucene-solr's branch refs/heads/master from Dennis Gove [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6365920 ] SOLR-8487 : Adds CommitStream to support sending commits to a collection being updated
          Show
          dpgove Dennis Gove added a comment - Added a section in the reference guide - https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-commit
          Hide
          dpgove Dennis Gove added a comment -

          Should've been resolved, not closed.

          Show
          dpgove Dennis Gove added a comment - Should've been resolved, not closed.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit edde433594c104668137350d9db640180b04f648 in lucene-solr's branch refs/heads/branch_6x from Dennis Gove
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=edde433 ]

          SOLR-8487: Adds CommitStream to support sending commits to a collection being updated

          Show
          jira-bot ASF subversion and git services added a comment - Commit edde433594c104668137350d9db640180b04f648 in lucene-solr's branch refs/heads/branch_6x from Dennis Gove [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=edde433 ] SOLR-8487 : Adds CommitStream to support sending commits to a collection being updated
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.3.0 release.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

            People

            • Assignee:
              dpgove Dennis Gove
              Reporter:
              gerlowskija Jason Gerlowski
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development