Lucene - Core
  1. Lucene - Core
  2. LUCENE-1612

expose lastDocId in the posting from the TermEnum API

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: 2.4
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      We currently have on the TermEnum api: docFreq() which gives the number docs in the posting.
      It would be good to also have the max docid in the posting. That information is useful when construction a custom DocIdSet, .e.g determine sparseness of the doc list to decide whether or not to use a BitSet.

      I have written a patch to do this, the problem with it is the TermInfosWriter encodes values in VInt/VLong, there is very little flexibility to add in lastDocId while making the index backward compatible. (If simple int is used for say, docFreq, a bit can be used to flag reading of a new piece of information)

      output.writeVInt(ti.docFreq); // write doc freq
      output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers
      output.writeVLong(ti.proxPointer - lastTi.proxPointer);

      Anyway, patch is attached with:TestSegmentTermEnum modified to test this. TestBackwardsCompatibility fails due to reasons described above.

        Activity

        Hide
        John Wang added a comment -

        Patch attach with test. Index is not backwards compatible.

        Show
        John Wang added a comment - Patch attach with test. Index is not backwards compatible.
        Hide
        Michael McCandless added a comment -

        This would be a good test (custom codec) for flexible indexing (LUCENE-1458), ie, allowing you to write whatever you want per-term.

        Also, if lastDocID is always available, this could make merging of postings much faster than it is today, because you could bulk-copy the doc/freq posting bytes while just "fixing up" the boundary between them, because docIDs are delta coded.

        Show
        Michael McCandless added a comment - This would be a good test (custom codec) for flexible indexing ( LUCENE-1458 ), ie, allowing you to write whatever you want per-term. Also, if lastDocID is always available, this could make merging of postings much faster than it is today, because you could bulk-copy the doc/freq posting bytes while just "fixing up" the boundary between them, because docIDs are delta coded.
        Hide
        John Wang added a comment -

        Excellent point Michael! What do you suggest on how to move forward with this?

        Show
        John Wang added a comment - Excellent point Michael! What do you suggest on how to move forward with this?
        Hide
        Michael McCandless added a comment -

        Well... a couple problems with always doing this:

        • The in-memory terms index now consumes another 4 bytes per indexed term
        • The tii/tis files got larger

        One way to optimize it might be to only record it for terms whose freq is > X (and for the long tail of low-freq terms you iterate its postings to get the last docID).

        Also, most apps don't need this information. So I don't think we should turn this on, always.

        So maybe we wait for LUCENE-1458 and then build this as an alternate codec?

        Show
        Michael McCandless added a comment - Well... a couple problems with always doing this: The in-memory terms index now consumes another 4 bytes per indexed term The tii/tis files got larger One way to optimize it might be to only record it for terms whose freq is > X (and for the long tail of low-freq terms you iterate its postings to get the last docID). Also, most apps don't need this information. So I don't think we should turn this on, always. So maybe we wait for LUCENE-1458 and then build this as an alternate codec?
        Hide
        Yonik Seeley added a comment -

        So maybe we wait for LUCENE-1458 and then build this as an alternate codec?

        +1, seems pretty specialized.

        Show
        Yonik Seeley added a comment - So maybe we wait for LUCENE-1458 and then build this as an alternate codec? +1, seems pretty specialized.
        Hide
        John Wang added a comment -

        I am fine with waiting for LUCENE-1458. But Michael, then how would it help the merge of postings you described? Merging would be outside of the codec, no?

        Show
        John Wang added a comment - I am fine with waiting for LUCENE-1458 . But Michael, then how would it help the merge of postings you described? Merging would be outside of the codec, no?
        Hide
        Michael McCandless added a comment -

        Actually I think the codec will handle merging (this was recently proposed for payloads), so it should be able to do that optimization within itself.

        Show
        Michael McCandless added a comment - Actually I think the codec will handle merging (this was recently proposed for payloads), so it should be able to do that optimization within itself.

          People

          • Assignee:
            Unassigned
            Reporter:
            John Wang
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development