Solr
  1. Solr
  2. SOLR-2701

Expose IndexWriter.commit(Map<String,String> commitUserData) to solr

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: update
    • Labels:

      Description

      At the moment, there is no feature that enables associating user information to the commit point.

      Lucene supports this possibility and it should be exposed to solr as well, probably via beforeCommit Listener (analogous to prepareCommit in Lucene).

      Most likely home for this Map to live is UpdateHandler.

      Example use case would be an atomic tracking of sequence numbers or timestamps for incremental updates.

        Activity

        Hide
        Eks Dev added a comment -

        one hook for users to update content of this map would be to add beforeCommit callbacks. This looks simple enough in UpdateHandler2.commit() call, but there is a catch:

        We need to invoke listeners before we close() for implicit commits... having decref-ed IndexWriter, the question is if we want to run beforeCommit listeners even if IW does not really get closed (user updates map more often than needed).

        IMO, this should not be a problem, invoking callbacks a little bit more often than needed.

        Another place where we have "implicit commit" is newIndexWriter() /
        here we need only to add IndexWriterProvider.isIndexWriterNull() to check if we need callbacks

        A solution for close() would be also simple by adding IndexWriterProvider.isIndexGoingToCloseOnNextDecref() before invoking decref() to condition callbacks

        Any better solution? Are the callbacks good approach to provide user hooks for this?

        -------
        Another approach is to get beforeCommitCallbacks at lucene level and piggy-back there for solr callbacks?
        We would only need to change IndexWriter.commit(Map..) and close() but commit is final...

        Notice: I am very rusty considering solr/lucene codebase => any help would be appreciated. Last patch I made here is ages ago

        Show
        Eks Dev added a comment - one hook for users to update content of this map would be to add beforeCommit callbacks. This looks simple enough in UpdateHandler2.commit() call, but there is a catch: We need to invoke listeners before we close() for implicit commits... having decref-ed IndexWriter, the question is if we want to run beforeCommit listeners even if IW does not really get closed (user updates map more often than needed). IMO, this should not be a problem, invoking callbacks a little bit more often than needed. Another place where we have "implicit commit" is newIndexWriter() / here we need only to add IndexWriterProvider.isIndexWriterNull() to check if we need callbacks A solution for close() would be also simple by adding IndexWriterProvider.isIndexGoingToCloseOnNextDecref() before invoking decref() to condition callbacks Any better solution? Are the callbacks good approach to provide user hooks for this? ------- Another approach is to get beforeCommitCallbacks at lucene level and piggy-back there for solr callbacks? We would only need to change IndexWriter.commit(Map..) and close() but commit is final... Notice: I am very rusty considering solr/lucene codebase => any help would be appreciated. Last patch I made here is ages ago
        Hide
        Eks Dev added a comment -

        rather simplistic approach, adding userCommitData to CommitUpdateCommand.

        So we at least have a vehicle to pass it to IndexWriter.

        No advanced machinery to make it available to non-expert users. At least ti is not wrong to have it there?

        Eclipse removed some unused imports from DUH2 as well

        Show
        Eks Dev added a comment - rather simplistic approach, adding userCommitData to CommitUpdateCommand. So we at least have a vehicle to pass it to IndexWriter. No advanced machinery to make it available to non-expert users. At least ti is not wrong to have it there? Eclipse removed some unused imports from DUH2 as well
        Hide
        Erick Erickson added a comment -

        I'd like to move this forward, so I'm soliciting a bit of advice about how to proceed.

        I'm interested here in getting this into SolrJ, it's not clear to me that this belongs in, say, an XML input file (and csv and json and...) since we have a nice clean document add format and trying to put index meta-data in there seems like a bad thing to do.

        Anyway, if we do go down the SolrJ route, it seems like SolrServer needs either two more commit methods that take a Map<String, String> or something like a new addUserData method, the latter seems cleaner.

        Then we'd have to do something with UpdateRequest to get the use-data passed over to the Solr server and from there pass it on through to the writer.commit.

        Mostly, I'm looking for guidance on whether this is a reasonable approach or if it's wrong-headed from the start, in which case I'll take any suggestions gladly.. Haven't started to code anything yet, so changes in the approach are really cheap....

        Eks Dev: Do you want to push this forward and/or work on it together?

        Show
        Erick Erickson added a comment - I'd like to move this forward, so I'm soliciting a bit of advice about how to proceed. I'm interested here in getting this into SolrJ, it's not clear to me that this belongs in, say, an XML input file (and csv and json and...) since we have a nice clean document add format and trying to put index meta-data in there seems like a bad thing to do. Anyway, if we do go down the SolrJ route, it seems like SolrServer needs either two more commit methods that take a Map<String, String> or something like a new addUserData method, the latter seems cleaner. Then we'd have to do something with UpdateRequest to get the use-data passed over to the Solr server and from there pass it on through to the writer.commit. Mostly, I'm looking for guidance on whether this is a reasonable approach or if it's wrong-headed from the start, in which case I'll take any suggestions gladly.. Haven't started to code anything yet, so changes in the approach are really cheap.... Eks Dev: Do you want to push this forward and/or work on it together?
        Hide
        Eks Dev added a comment -

        @Erick, just go ahead and take it.
        I am not going to be working on this any time soon. At the moment I am using quck'n dirty patched trunk version (moving target anyways) with extended CommitCommand to pass Map around (sub-optimal approach? but does the work for now).

        Some thinking about it, maybe you find something useful:

        Take care, optimize and autoCommit do implicit commit (from user perspective, there is no explicit transaction to commit where we could pass Map parameters). This, as a consequence, requires Map<String, String> to be alive somewhere (DUH2 looks like the best place for it). Of course, one needs to expose some user interfaces that will enable map mutation and inquiry. This Map then becomes cached key-value pairs holder a user can change and solr offers guaranties to commit it on implicit/explicit commit and read it on reload/rollback

        Rollback and restart, e.g. what happens to this map after restart (core reload)? I would suggest populating it with committed values, on rollback as well.

        As a summary:

        • One thing is low level mechanics, this is easy: all changes are local to DUH2, one Map<String, String> and passing this instance to all commit commands you see there. Of course, reloading it on index reload/rollback
        • Much harder (at least for me): designing good user interface to maintain it,
          ... explicit changes vie request handler (admin like operation)
          ... as parameter of the commit command (nice)
          ... somehow hooking into update chain elegantly (My primary use case! I keep track of the max timestamp in this map (actually in AtomicLong, just populating Map on commit) to control incremental updates, but my use case is dumb easy to support with patched CommitCommand as I have only explicit commits (this wold not work with e.g. autoCommit, you would need Map instance for it).

        e.g. Look at DIH, it uses internal counters and file system to persist it for this, that could be much better served by lucene commit guaranties...

        On another note, keeping real-time (not committed values) track of min/max values for user defined fields would make sense for incremental update scenarios, I do not know if there is something in lucene/solr for it already, but this is another, but somehow related issue...

        Cheers,
        Eks

        Show
        Eks Dev added a comment - @Erick, just go ahead and take it. I am not going to be working on this any time soon. At the moment I am using quck'n dirty patched trunk version (moving target anyways) with extended CommitCommand to pass Map around (sub-optimal approach? but does the work for now). Some thinking about it, maybe you find something useful: Take care, optimize and autoCommit do implicit commit (from user perspective, there is no explicit transaction to commit where we could pass Map parameters). This, as a consequence, requires Map<String, String> to be alive somewhere (DUH2 looks like the best place for it). Of course, one needs to expose some user interfaces that will enable map mutation and inquiry. This Map then becomes cached key-value pairs holder a user can change and solr offers guaranties to commit it on implicit/explicit commit and read it on reload/rollback Rollback and restart, e.g. what happens to this map after restart (core reload)? I would suggest populating it with committed values, on rollback as well. As a summary: One thing is low level mechanics, this is easy: all changes are local to DUH2, one Map<String, String> and passing this instance to all commit commands you see there. Of course, reloading it on index reload/rollback Much harder (at least for me): designing good user interface to maintain it, ... explicit changes vie request handler (admin like operation) ... as parameter of the commit command (nice) ... somehow hooking into update chain elegantly (My primary use case! I keep track of the max timestamp in this map (actually in AtomicLong, just populating Map on commit) to control incremental updates, but my use case is dumb easy to support with patched CommitCommand as I have only explicit commits (this wold not work with e.g. autoCommit, you would need Map instance for it). e.g. Look at DIH, it uses internal counters and file system to persist it for this, that could be much better served by lucene commit guaranties... On another note, keeping real-time (not committed values) track of min/max values for user defined fields would make sense for incremental update scenarios, I do not know if there is something in lucene/solr for it already, but this is another, but somehow related issue... Cheers, Eks
        Hide
        Gregg Donovan added a comment -

        We (Etsy) are interested in this issue for the same use-case that Eks Dev mentions – passing around timestamps and other meta-data for use by incremental indexers. We currently write out and replicate custom property files for this – using commitUserData would be preferable.

        It seems like another use-case the commitUserData could be useful for is doing an empty commit that actually triggered replication, as the updated commitUserData will cause the segments file to be updated.

        For our purposes, we'd just be using CommitUpdateCommand and DUH2 as our interfaces for writing the commitUserData, but exposing commitUserData to the SolrJ/HTTP interfaces does seem like a nice feature. I wonder where it would be useful to expose reading the commitUserData via SolrJ/HTTP as right now you still need low-level code to extract the commitUserData from an IndexReader. Perhaps stats.jsp could expose each key in the commitUserData as a stat?

        Show
        Gregg Donovan added a comment - We (Etsy) are interested in this issue for the same use-case that Eks Dev mentions – passing around timestamps and other meta-data for use by incremental indexers. We currently write out and replicate custom property files for this – using commitUserData would be preferable. It seems like another use-case the commitUserData could be useful for is doing an empty commit that actually triggered replication, as the updated commitUserData will cause the segments file to be updated. For our purposes, we'd just be using CommitUpdateCommand and DUH2 as our interfaces for writing the commitUserData, but exposing commitUserData to the SolrJ/HTTP interfaces does seem like a nice feature. I wonder where it would be useful to expose reading the commitUserData via SolrJ/HTTP as right now you still need low-level code to extract the commitUserData from an IndexReader. Perhaps stats.jsp could expose each key in the commitUserData as a stat?
        Hide
        Erick Erickson added a comment -

        I won't be getting to this any time soon

        Show
        Erick Erickson added a comment - I won't be getting to this any time soon
        Hide
        Greg Bowyer added a comment -

        I gave this another attempt today, and went full bore on trying to find all the locations of where userCommitData would need to be exposed to clients of the SOLR API.

        There are a few questions in my mind about this:

        • The backwards compat for javabin is not obvious, do we want to change up the version on javabin
        • What should be the exacting behavior around soft and autocommits
        • Should previous index commits carry forward in solr for ease of use ?
        Show
        Greg Bowyer added a comment - I gave this another attempt today, and went full bore on trying to find all the locations of where userCommitData would need to be exposed to clients of the SOLR API. There are a few questions in my mind about this: The backwards compat for javabin is not obvious, do we want to change up the version on javabin What should be the exacting behavior around soft and autocommits Should previous index commits carry forward in solr for ease of use ?
        Hide
        Yonik Seeley added a comment -

        Should previous index commits carry forward in solr for ease of use ?

        I haven't had a chance to check out the rest of the patch/issue, but for this specifically, what about a convention? Anything under the "persistent" key in the commit data is carried over indefinitely. Or if persistent is the norm, then we could reverse it and have a "transient" map that is not carried over.

        Show
        Yonik Seeley added a comment - Should previous index commits carry forward in solr for ease of use ? I haven't had a chance to check out the rest of the patch/issue, but for this specifically, what about a convention? Anything under the "persistent" key in the commit data is carried over indefinitely. Or if persistent is the norm, then we could reverse it and have a "transient" map that is not carried over.
        Hide
        Greg Bowyer added a comment -

        I haven't had a chance to check out the rest of the patch/issue, but for this specifically, what about a convention? Anything under the "persistent" key in the commit data is carried over indefinitely. Or if persistent is the norm, then we could reverse it and have a "transient" map that is not carried over.

        The persistent/transient map sounds like a good idea; I will take a look at how that can be implemented

        Show
        Greg Bowyer added a comment - I haven't had a chance to check out the rest of the patch/issue, but for this specifically, what about a convention? Anything under the "persistent" key in the commit data is carried over indefinitely. Or if persistent is the norm, then we could reverse it and have a "transient" map that is not carried over. The persistent/transient map sounds like a good idea; I will take a look at how that can be implemented

          People

          • Assignee:
            Greg Bowyer
            Reporter:
            Eks Dev
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 8h
              8h
              Remaining:
              Remaining Estimate - 8h
              8h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development