Lucene - Core
  1. Lucene - Core
  2. LUCENE-1612

expose lastDocId in the posting from the TermEnum API

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: 2.4
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      We currently have on the TermEnum api: docFreq() which gives the number docs in the posting.
      It would be good to also have the max docid in the posting. That information is useful when construction a custom DocIdSet, .e.g determine sparseness of the doc list to decide whether or not to use a BitSet.

      I have written a patch to do this, the problem with it is the TermInfosWriter encodes values in VInt/VLong, there is very little flexibility to add in lastDocId while making the index backward compatible. (If simple int is used for say, docFreq, a bit can be used to flag reading of a new piece of information)

      output.writeVInt(ti.docFreq); // write doc freq
      output.writeVLong(ti.freqPointer - lastTi.freqPointer); // write pointers
      output.writeVLong(ti.proxPointer - lastTi.proxPointer);

      Anyway, patch is attached with:TestSegmentTermEnum modified to test this. TestBackwardsCompatibility fails due to reasons described above.

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        1399d 2h 47m 1 John Wang 22/Feb/13 05:49
        John Wang made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Invalid [ 6 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12563270 ] jira [ 12584374 ]
        Mark Thomas made changes -
        Workflow jira [ 12461720 ] Default workflow, editable Closed status [ 12563270 ]
        Hide
        Michael McCandless added a comment -

        Actually I think the codec will handle merging (this was recently proposed for payloads), so it should be able to do that optimization within itself.

        Show
        Michael McCandless added a comment - Actually I think the codec will handle merging (this was recently proposed for payloads), so it should be able to do that optimization within itself.
        Hide
        John Wang added a comment -

        I am fine with waiting for LUCENE-1458. But Michael, then how would it help the merge of postings you described? Merging would be outside of the codec, no?

        Show
        John Wang added a comment - I am fine with waiting for LUCENE-1458 . But Michael, then how would it help the merge of postings you described? Merging would be outside of the codec, no?
        Hide
        Yonik Seeley added a comment -

        So maybe we wait for LUCENE-1458 and then build this as an alternate codec?

        +1, seems pretty specialized.

        Show
        Yonik Seeley added a comment - So maybe we wait for LUCENE-1458 and then build this as an alternate codec? +1, seems pretty specialized.
        Hide
        Michael McCandless added a comment -

        Well... a couple problems with always doing this:

        • The in-memory terms index now consumes another 4 bytes per indexed term
        • The tii/tis files got larger

        One way to optimize it might be to only record it for terms whose freq is > X (and for the long tail of low-freq terms you iterate its postings to get the last docID).

        Also, most apps don't need this information. So I don't think we should turn this on, always.

        So maybe we wait for LUCENE-1458 and then build this as an alternate codec?

        Show
        Michael McCandless added a comment - Well... a couple problems with always doing this: The in-memory terms index now consumes another 4 bytes per indexed term The tii/tis files got larger One way to optimize it might be to only record it for terms whose freq is > X (and for the long tail of low-freq terms you iterate its postings to get the last docID). Also, most apps don't need this information. So I don't think we should turn this on, always. So maybe we wait for LUCENE-1458 and then build this as an alternate codec?
        Hide
        John Wang added a comment -

        Excellent point Michael! What do you suggest on how to move forward with this?

        Show
        John Wang added a comment - Excellent point Michael! What do you suggest on how to move forward with this?
        Hide
        Michael McCandless added a comment -

        This would be a good test (custom codec) for flexible indexing (LUCENE-1458), ie, allowing you to write whatever you want per-term.

        Also, if lastDocID is always available, this could make merging of postings much faster than it is today, because you could bulk-copy the doc/freq posting bytes while just "fixing up" the boundary between them, because docIDs are delta coded.

        Show
        Michael McCandless added a comment - This would be a good test (custom codec) for flexible indexing ( LUCENE-1458 ), ie, allowing you to write whatever you want per-term. Also, if lastDocID is always available, this could make merging of postings much faster than it is today, because you could bulk-copy the doc/freq posting bytes while just "fixing up" the boundary between them, because docIDs are delta coded.
        John Wang made changes -
        Field Original Value New Value
        Attachment lucene-1612-patch.txt [ 12406414 ]
        Hide
        John Wang added a comment -

        Patch attach with test. Index is not backwards compatible.

        Show
        John Wang added a comment - Patch attach with test. Index is not backwards compatible.
        John Wang created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            John Wang
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development