Lucene - Core
  1. Lucene - Core
  2. LUCENE-1382

Allow storing user data when IndexWriter.commit() is called

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from here:

      http://www.mail-archive.com/java-user@lucene.apache.org/msg22303.html

      The idea is to allow optionally passing an opaque String commitUserData to the IndexWriter.commit method. This String would be stored in the segments_N file, and would be retrievable by an IndexReader. Applications could then use this to assign meaning to each commit.

      It would be nice to get this done for 2.4, but I don't think we should hold the release for it.

      1. LUCENE-1382.patch
        12 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Attached patch. I think we should not rush this into 2.4, so, I'll wait to commit to trunk until after we've branched.

        Show
        Michael McCandless added a comment - Attached patch. I think we should not rush this into 2.4, so, I'll wait to commit to trunk until after we've branched.
        Hide
        Michael McCandless added a comment -

        It seems like there is a place for both approaches, since they have different tradeoffs?

        I plan to add a static IndexReader method, much like eg getVersion, that lets you retrieve the commitUserData of the current segments_N file. So, you're right, this enables fast checking of the latest commitUserData in the index.

        I was thinking it's a single String, and if you set it (by calling commit(commitUserData)), and the commit completes successfully, it overwrites whatever was there, previously.

        Show
        Michael McCandless added a comment - It seems like there is a place for both approaches, since they have different tradeoffs? I plan to add a static IndexReader method, much like eg getVersion, that lets you retrieve the commitUserData of the current segments_N file. So, you're right, this enables fast checking of the latest commitUserData in the index. I was thinking it's a single String, and if you set it (by calling commit(commitUserData)), and the commit completes successfully, it overwrites whatever was there, previously.
        Hide
        Mark Harwood added a comment -

        I feel uncomfortable with the suggestion of inserting "fake documents for the reasons outlined here:
        http://www.mail-archive.com/java-user@lucene.apache.org/msg22294.html

        Show
        Mark Harwood added a comment - I feel uncomfortable with the suggestion of inserting "fake documents for the reasons outlined here: http://www.mail-archive.com/java-user@lucene.apache.org/msg22294.html
        Hide
        Nadav Har'El added a comment -

        Hi Mike,
        If you add this feature, I suggest you clearly document its purpose. Writing a short comment in the segments file can be useful when utmost performance is needed (e.g., like when you just want to check if you want to reopen() or not), but my guess is that for most other scenarios, there's an alternative in the existing Lucene.

        I.e., one possible way to achieve almost the same goal but without changing anything in the code is to put a special document in the index - e.g., imagine you put a document with some unique searchable field/term combination (just to find this document), and a stored field with your "userdata". Before doing the next commit(), just update (i.e., delete and add) this document to a new value, and commit().

        This method is not as super-quick as the one you proposed, but I think that for most uses it is quick enough, and more versatile. For example, what happens with the proposed feature if the code that needs to write this "user data" is in a library, which cannot control exactly the commit() times? And what happen if several different libraries or code modules want to write their own different "commit user data"? With the simple alternative method I mentioned you have: 1. several "commit user data"s can exist (by using different field/term to find them), 2. a library can put the commit user data to the index and have it take effect on the next commit (rather than needing to specify it in the commit() call). 3. an extra commit() call does not delete the previously set data (I'm not sure what you intend to do in this case in your suggestion).

        I guess that it wouldn't hurt to add the feature that you propose - I just hope that people don't start using it for things that established Lucene mechanisms (like documents) would have been better.

        Show
        Nadav Har'El added a comment - Hi Mike, If you add this feature, I suggest you clearly document its purpose. Writing a short comment in the segments file can be useful when utmost performance is needed (e.g., like when you just want to check if you want to reopen() or not), but my guess is that for most other scenarios, there's an alternative in the existing Lucene. I.e., one possible way to achieve almost the same goal but without changing anything in the code is to put a special document in the index - e.g., imagine you put a document with some unique searchable field/term combination (just to find this document), and a stored field with your "userdata". Before doing the next commit(), just update (i.e., delete and add) this document to a new value, and commit(). This method is not as super-quick as the one you proposed, but I think that for most uses it is quick enough, and more versatile. For example, what happens with the proposed feature if the code that needs to write this "user data" is in a library, which cannot control exactly the commit() times? And what happen if several different libraries or code modules want to write their own different "commit user data"? With the simple alternative method I mentioned you have: 1. several "commit user data"s can exist (by using different field/term to find them), 2. a library can put the commit user data to the index and have it take effect on the next commit (rather than needing to specify it in the commit() call). 3. an extra commit() call does not delete the previously set data (I'm not sure what you intend to do in this case in your suggestion). I guess that it wouldn't hurt to add the feature that you propose - I just hope that people don't start using it for things that established Lucene mechanisms (like documents) would have been better.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development