Solr
  1. Solr
  2. SOLR-126

Auto-commit documents after time interval

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: update
    • Labels:
      None

      Description

      If an index is getting updated from multiple sources and needs to add documents reasonably quickly, there should be a good solr side mechanism to help prevent the client from spawning multiple overlapping <commit/> commands.

      My specific use case is sending each document to solr every time hibernate saves an object (see SOLR-20). This happens from multiple machines simultaneously. I'd like solr to make sure the documents are committed within a second.

      1. AutoCommit.patch
        14 kB
        Ryan McKinley
      2. AutocommitingUpdateRequestHandler.patch
        14 kB
        Ryan McKinley
      3. SOLR-126-ClosePending.patch
        0.6 kB
        Ryan McKinley

        Activity

        Hide
        Ryan McKinley added a comment -

        I'm not sure the best design for this problem, as a quick way to get something working NOW without modifying UpdateHandler, I made a custom UpdateHandler (extended from XmlUpdateRequestHandler) that starts a timer and executes <commit/> after a configured time.

        • - - - - -

        The auto-commit logic should probably be in the UpdateHandler along with the exiting CommitTracker. The existing CommitTracker lets you specify a number of documents it should keep before indexing.

        For a time limit, I think we need to run a thread - I know that is bad for webapps, but I'm not sure there is any other option. In this example, I used:

        final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);

        then:

        scheduler.schedule( this, interval, TimeUnit.MILLISECONDS );

        Is this a reasonable approach? Is there any other threading/timing mechanism to consider?
        Should this be applied directly to the UpdateHandler and configured along with

        <autoCommit>
        <maxDocs>10000</maxDocs>
        <maxTime>10000</maxTime> <!-- milliseconds -->
        </autoCommit>

        or should it be a custom request handler (as implemented in this attachment)?

        Show
        Ryan McKinley added a comment - I'm not sure the best design for this problem, as a quick way to get something working NOW without modifying UpdateHandler, I made a custom UpdateHandler (extended from XmlUpdateRequestHandler) that starts a timer and executes <commit/> after a configured time. - - - - - The auto-commit logic should probably be in the UpdateHandler along with the exiting CommitTracker. The existing CommitTracker lets you specify a number of documents it should keep before indexing. For a time limit, I think we need to run a thread - I know that is bad for webapps, but I'm not sure there is any other option. In this example, I used: final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1); then: scheduler.schedule( this, interval, TimeUnit.MILLISECONDS ); Is this a reasonable approach? Is there any other threading/timing mechanism to consider? Should this be applied directly to the UpdateHandler and configured along with <autoCommit> <maxDocs>10000</maxDocs> <maxTime>10000</maxTime> <!-- milliseconds --> </autoCommit> or should it be a custom request handler (as implemented in this attachment)?
        Hide
        Mike Klaas added a comment -

        A few high-level comments:

        • commitTer/commitTing. It seems pedantic to gripe about spelling, but making it right really helps in finding things at a later date.
        • "autoCommitAfter" doesn't particularly document the semantics, which are "autocommit interval upper bound". The reason that this is important is that we'll want to implement "autocommit interval lower bound" at some point (and/or "autocommit after idle time").
        • I think this would fit pretty easily into the existing committracker (just gut the time-based things that are already there). Unless there is any reason why this should be limited to a custom handler?
        Show
        Mike Klaas added a comment - A few high-level comments: commitTer/commitTing. It seems pedantic to gripe about spelling, but making it right really helps in finding things at a later date. "autoCommitAfter" doesn't particularly document the semantics, which are "autocommit interval upper bound". The reason that this is important is that we'll want to implement "autocommit interval lower bound" at some point (and/or "autocommit after idle time"). I think this would fit pretty easily into the existing committracker (just gut the time-based things that are already there). Unless there is any reason why this should be limited to a custom handler?
        Hide
        Ryan McKinley added a comment -

        >
        > A few high-level comments:
        > - commitTer/commitTing. It seems pedantic to gripe about spelling,...

        yes, thanks. I'm notoriously bad anywhere spellcheck can't help (and often where it does!) Please fix or point out anything, it does make it much easier to find in the future.

        > - "autoCommitAfter" doesn't particularly document the semantics,...

        perhaps:

        <autoCommit>
        <maxDocs>10000</maxDocs>
        <maxTime>10000</maxTime>
        <!--
        In the future we may add:
        <minTime>10000</minTime>
        <idleTime>10000</idleTime>
        -->
        </autoCommit>

        > > Unless there is any reason why this should be limited to a custom handler?
        >

        Yes, it should probably go in the UpdateHandler

        Show
        Ryan McKinley added a comment - > > A few high-level comments: > - commitTer/commitTing. It seems pedantic to gripe about spelling,... yes, thanks. I'm notoriously bad anywhere spellcheck can't help (and often where it does!) Please fix or point out anything, it does make it much easier to find in the future. > - "autoCommitAfter" doesn't particularly document the semantics,... perhaps: <autoCommit> <maxDocs>10000</maxDocs> <maxTime>10000</maxTime> <!-- In the future we may add: <minTime>10000</minTime> <idleTime>10000</idleTime> --> </autoCommit> > > Unless there is any reason why this should be limited to a custom handler? > Yes, it should probably go in the UpdateHandler
        Hide
        Yonik Seeley added a comment -

        Thanks for tackling this Ryan, this is much needed functionality (planned since inception but never coded!)
        This should definitely go into the existing committracker/updatehandler.

        Timer/TimerTask for the timers? Haven't used them myself... are they appropriate?

        Some future issues (i.e. a different JIRA issue)... one setting may make perfect sense when doing incremental updates, to enforce a minimum "freshness" and take the burden of commit off of the clients. When building an index from scratch, you want to do it as quickly as possible, and not do commits all the time (or even expose a partial index to searches).
        Perhaps something like &autocommit=false in the update params?
        It might even be nice to be able to change the mergeFactor for a complete index build vs incremental updates.

        Show
        Yonik Seeley added a comment - Thanks for tackling this Ryan, this is much needed functionality (planned since inception but never coded!) This should definitely go into the existing committracker/updatehandler. Timer/TimerTask for the timers? Haven't used them myself... are they appropriate? Some future issues (i.e. a different JIRA issue)... one setting may make perfect sense when doing incremental updates, to enforce a minimum "freshness" and take the burden of commit off of the clients. When building an index from scratch, you want to do it as quickly as possible, and not do commits all the time (or even expose a partial index to searches). Perhaps something like &autocommit=false in the update params? It might even be nice to be able to change the mergeFactor for a complete index build vs incremental updates.
        Hide
        Ryan McKinley added a comment -

        >
        > Timer/TimerTask for the timers? Haven't used them myself... are they appropriate?
        >

        I'm no expert on this, but it looks like Java5 added:

        http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ScheduledExecutorService.html

        it does everything Timer/TimerTask did - I'm not sure (and can't seem to find) if the difference is just syntax or something more profound.

        Show
        Ryan McKinley added a comment - > > Timer/TimerTask for the timers? Haven't used them myself... are they appropriate? > I'm no expert on this, but it looks like Java5 added: http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ScheduledExecutorService.html it does everything Timer/TimerTask did - I'm not sure (and can't seem to find) if the difference is just syntax or something more profound.
        Hide
        Yonik Seeley added a comment -

        Another future idea along the lines of &autocommit=true/false, is supporting clients with different timeliness needs.
        For example, a client could send an update request, telling solr that a commit should be done within 5 minutes.
        &maxwait=300 or something like that.

        Show
        Yonik Seeley added a comment - Another future idea along the lines of &autocommit=true/false, is supporting clients with different timeliness needs. For example, a client could send an update request, telling solr that a commit should be done within 5 minutes. &maxwait=300 or something like that.
        Hide
        Yonik Seeley added a comment -

        ScheduledExecutorService: Yeah, I'd start with that by default... the Java5 concurrency stuff is much better in general.

        Show
        Yonik Seeley added a comment - ScheduledExecutorService: Yeah, I'd start with that by default... the Java5 concurrency stuff is much better in general.
        Hide
        Ryan McKinley added a comment -

        I just posted AutoCommit.patch

        This patch modifies DirectUpdateHandler2.CommentTracker to automatically commit a after a certain period. As written, It should never start multiple commits at the same time.

        I think this is ready to go.

        Show
        Ryan McKinley added a comment - I just posted AutoCommit.patch This patch modifies DirectUpdateHandler2.CommentTracker to automatically commit a after a certain period. As written, It should never start multiple commits at the same time. I think this is ready to go.
        Hide
        Mike Klaas added a comment -

        Ryan: looking good! A few comments:

        • You notify the tracker that the document is added before actually adding the document. This is okay-commit() cannot run until addDoc() is complete-but it does mean that the autocommit maxTime is measured from the start of the document being added until after it has been processed. I'm not sure it matters in practice.
        • similarly, didCommit() is invoked before the searcher is warmed. Autocommits will never occur simulatneously (as you note; due to synchronization of run()), but they could be invoked continually if warming takes a long time.
        • If 250ms is a small enough time to not care about, does it make sense to force the user to specify the time in milliseconds?

        These are all relatively minor things--if no one else has any thoughts this can probably be committed soon.

        Show
        Mike Klaas added a comment - Ryan: looking good! A few comments: You notify the tracker that the document is added before actually adding the document. This is okay- commit() cannot run until addDoc() is complete -but it does mean that the autocommit maxTime is measured from the start of the document being added until after it has been processed. I'm not sure it matters in practice. similarly, didCommit() is invoked before the searcher is warmed. Autocommits will never occur simulatneously (as you note; due to synchronization of run()), but they could be invoked continually if warming takes a long time. If 250ms is a small enough time to not care about, does it make sense to force the user to specify the time in milliseconds? These are all relatively minor things--if no one else has any thoughts this can probably be committed soon.
        Hide
        Ryan McKinley added a comment -

        >
        > - You notify the tracker that the document is added before actually adding the document. This is okay-commit() cannot run until addDoc() is complete-but it does mean that the autocommit maxTime is measured from the start of the document being added until after it has been processed. I'm not sure it matters in practice.
        >

        I'm looking at it from the client perspective. The timer should start as soon as close to the request time as possible.

        > - similarly, didCommit() is invoked before the searcher is warmed. Autocommits will never occur simulatneously (as you note; due to synchronization of run()), but they could be invoked continually if warming takes a long time.
        >

        I just left at were it was in the existing code. I think it makes sense because the searcher has the proper data at that point - a second commit wont change the results.

        Also, it will not start a new autocommit until the first has warmed the searcher anyway:

        CommitUpdateCommand command = new CommitUpdateCommand( false );
        command.waitFlush = true;
        command.waitSearcher = true;

        > - If 250ms is a small enough time to not care about, does it make sense to force the user to specify the time in milliseconds?
        >

        This is trying to avoid is the case where 100 documents are added at the same time with maxDocs=10. We don't want to commit 10 times, so it waits 1/4 sec. (could be shorter or longer in my opinion)

        If anyone is worried about the timing, they should use maxTime, not maxDocs

        Show
        Ryan McKinley added a comment - > > - You notify the tracker that the document is added before actually adding the document. This is okay- commit() cannot run until addDoc() is complete -but it does mean that the autocommit maxTime is measured from the start of the document being added until after it has been processed. I'm not sure it matters in practice. > I'm looking at it from the client perspective. The timer should start as soon as close to the request time as possible. > - similarly, didCommit() is invoked before the searcher is warmed. Autocommits will never occur simulatneously (as you note; due to synchronization of run()), but they could be invoked continually if warming takes a long time. > I just left at were it was in the existing code. I think it makes sense because the searcher has the proper data at that point - a second commit wont change the results. Also, it will not start a new autocommit until the first has warmed the searcher anyway: CommitUpdateCommand command = new CommitUpdateCommand( false ); command.waitFlush = true; command.waitSearcher = true; > - If 250ms is a small enough time to not care about, does it make sense to force the user to specify the time in milliseconds? > This is trying to avoid is the case where 100 documents are added at the same time with maxDocs=10. We don't want to commit 10 times, so it waits 1/4 sec. (could be shorter or longer in my opinion) If anyone is worried about the timing, they should use maxTime, not maxDocs
        Hide
        Mike Klaas added a comment -

        Committed in r502328. Thanks!

        Ryan, the last comment of mine was about the units of time that <maxTime> was specified in--the old version had the time specified in seconds and the parameter name was <maxSec>. I committed it the way it stands in the patch; if anyone has a strong opinion, this can be changed before being closed.

        Show
        Mike Klaas added a comment - Committed in r502328. Thanks! Ryan, the last comment of mine was about the units of time that <maxTime> was specified in--the old version had the time specified in seconds and the parameter name was <maxSec>. I committed it the way it stands in the patch; if anyone has a strong opinion, this can be changed before being closed.
        Hide
        Ryan McKinley added a comment -

        It looks like the test solrconfi.xml got commited with some mangled xml comments:

        <!-- autocommit pending docs if certain criteria are met
        <autoCommit>
        <maxDocs>10000</maxDocs>
        <maxTime>3600000</maxTime> <!-- one hour in milliseconds -->
        </autoCommit>
        -->

        so it fails the tests

        it needs to be:

        <!-- autocommit pending docs if certain criteria are met -->
        <autoCommit>
        <maxDocs>10000</maxDocs>
        <maxTime>3600000</maxTime> <!-- one hour in milliseconds -->
        </autoCommit>

        thanks!
        ryan

        Show
        Ryan McKinley added a comment - It looks like the test solrconfi.xml got commited with some mangled xml comments: <!-- autocommit pending docs if certain criteria are met <autoCommit> <maxDocs>10000</maxDocs> <maxTime>3600000</maxTime> <!-- one hour in milliseconds --> </autoCommit> --> so it fails the tests it needs to be: <!-- autocommit pending docs if certain criteria are met --> <autoCommit> <maxDocs>10000</maxDocs> <maxTime>3600000</maxTime> <!-- one hour in milliseconds --> </autoCommit> thanks! ryan
        Hide
        Mike Klaas added a comment -

        extra patch committed in r505114

        Show
        Mike Klaas added a comment - extra patch committed in r505114

          People

          • Assignee:
            Unassigned
            Reporter:
            Ryan McKinley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development