• Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
    • Lucene Fields:


      I attached a very rough checkpoint of my current patch, to get early
      feedback. All tests pass, though back compat tests don't pass due to
      changes to package-private APIs plus certain bugs in tests that
      happened to work (eg call TermPostions.nextPosition() too many times,
      which the new API asserts against).

      [Aside: I think, when we commit changes to package-private APIs such
      that back-compat tests don't pass, we could go back, make a branch on
      the back-compat tag, commit changes to the tests to use the new
      package private APIs on that branch, then fix nightly build to use the
      tip of that branch?o]

      There's still plenty to do before this is committable! This is a
      rather large change:

      • Switches to a new more efficient terms dict format. This still
        uses tii/tis files, but the tii only stores term & long offset
        (not a TermInfo). At seek points, tis encodes term & freq/prox
        offsets absolutely instead of with deltas delta. Also, tis/tii
        are structured by field, so we don't have to record field number
        in every term.
        On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
        -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
        RAM usage when loading terms dict index is significantly less
        since we only load an array of offsets and an array of String (no
        more TermInfo array). It should be faster to init too.
        This part is basically done.
      • Introduces modular reader codec that strongly decouples terms dict
        from docs/positions readers. EG there is no more TermInfo used
        when reading the new format.
        There's nice symmetry now between reading & writing in the codec
        chain – the current docs/prox format is captured in:
        FormatPostingsDocsWriter/Reader (.frq file) and
        FormatPostingsPositionsWriter/Reader (.prx file).

        This part is basically done.

      • Introduces a new "flex" API for iterating through the fields,
        terms, docs and positions:
        FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum

        This replaces TermEnum/Docs/Positions. SegmentReader emulates the
        old API on top of the new API to keep back-compat.

      Next steps:

      • Plug in new codecs (pulsing, pfor) to exercise the modularity /
        fix any hidden assumptions.
      • Expose new API out of IndexReader, deprecate old API but emulate
        old API on top of new one, switch all core/contrib users to the
        new API.
      • Maybe switch to AttributeSources as the base class for TermsEnum,
        DocsEnum, PostingsEnum – this would give readers API flexibility
        (not just index-file-format flexibility). EG if someone wanted
        to store payload at the term-doc level instead of
        term-doc-position level, you could just add a new attribute.
      • Test performance & iterate.
      1. LUCENE-1458.patch
        116 kB
        Michael McCandless
      2. LUCENE-1458.patch
        167 kB
        Michael McCandless
      3. LUCENE-1458.patch
        188 kB
        Michael McCandless
      4. LUCENE-1458.patch
        263 kB
        Michael McCandless
      5. LUCENE-1458.patch
        370 kB
        Michael McCandless
      6. LUCENE-1458.patch
        360 kB
        Michael Busch
      7. LUCENE-1458.tar.bz2
        1.80 MB
        Michael McCandless
      8. LUCENE-1458-back-compat.patch
        15 kB
        Michael McCandless
      9. LUCENE-1458.tar.bz2
        1.83 MB
        Michael McCandless
      10. LUCENE-1458.tar.bz2
        1.82 MB
        Michael McCandless
      11. LUCENE-1458-back-compat.patch
        15 kB
        Michael McCandless
      12. LUCENE-1458.tar.bz2
        1.83 MB
        Michael McCandless
      13. LUCENE-1458-back-compat.patch
        16 kB
        Michael McCandless
      14. LUCENE-1458.tar.bz2
        1.84 MB
        Michael McCandless
      15. LUCENE-1458-back-compat.patch
        16 kB
        Michael McCandless
      16. LUCENE-1458-back-compat.patch
        22 kB
        Michael McCandless
      17. LUCENE-1458.tar.bz2
        1.94 MB
        Michael McCandless
      18. LUCENE-1458-back-compat.patch
        22 kB
        Michael McCandless
      19. LUCENE-1458.tar.bz2
        1.93 MB
        Michael McCandless
      20. LUCENE-1458.patch
        1015 kB
        Mark Miller
      21. LUCENE-1458.patch
        1024 kB
        Mark Miller
      22. LUCENE-1458.patch
        886 kB
        Michael McCandless
      23. LUCENE-1458.patch
        895 kB
        Michael McCandless
      24. LUCENE-1458.patch
        909 kB
        Michael McCandless
      25. LUCENE-1458.patch
        878 kB
        Mark Miller
      26. LUCENE-1458.patch
        883 kB
        Mark Miller
      27. UnicodeTestCase.patch
        2 kB
        Robert Muir
      28. UnicodeTestCase.patch
        2 kB
        Robert Muir
      29. LUCENE-1458_termenum_bwcompat.patch
        1 kB
        Robert Muir
      30. LUCENE-1458_sortorder_bwcompat.patch
        3 kB
        Robert Muir
      31. LUCENE-1458_rotate.patch
        4 kB
        Robert Muir
      32. LUCENE-1458-NRQ.patch
        12 kB
        Uwe Schindler
      33. LUCENE-1458-MTQ-BW.patch
        2 kB
        Uwe Schindler
      34. LUCENE-1458-DocIdSetIterator.patch
        21 kB
        Uwe Schindler
      35. LUCENE-1458-DocIdSetIterator.patch
        22 kB
        Uwe Schindler

        Issue Links



            • Assignee:
              Michael McCandless
              Michael McCandless
            • Votes:
              1 Vote for this issue
              7 Start watching this issue


              • Created: