Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1458

Further steps towards flexible indexing

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 4.0-ALPHA
    • 4.0-ALPHA
    • core/index
    • None
    • New

    Description

      I attached a very rough checkpoint of my current patch, to get early
      feedback. All tests pass, though back compat tests don't pass due to
      changes to package-private APIs plus certain bugs in tests that
      happened to work (eg call TermPostions.nextPosition() too many times,
      which the new API asserts against).

      [Aside: I think, when we commit changes to package-private APIs such
      that back-compat tests don't pass, we could go back, make a branch on
      the back-compat tag, commit changes to the tests to use the new
      package private APIs on that branch, then fix nightly build to use the
      tip of that branch?o]

      There's still plenty to do before this is committable! This is a
      rather large change:

      • Switches to a new more efficient terms dict format. This still
        uses tii/tis files, but the tii only stores term & long offset
        (not a TermInfo). At seek points, tis encodes term & freq/prox
        offsets absolutely instead of with deltas delta. Also, tis/tii
        are structured by field, so we don't have to record field number
        in every term.
        .
        On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
        -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
        .
        RAM usage when loading terms dict index is significantly less
        since we only load an array of offsets and an array of String (no
        more TermInfo array). It should be faster to init too.
        .
        This part is basically done.
      • Introduces modular reader codec that strongly decouples terms dict
        from docs/positions readers. EG there is no more TermInfo used
        when reading the new format.
        .
        There's nice symmetry now between reading & writing in the codec
        chain – the current docs/prox format is captured in:
        FormatPostingsTermsDictWriter/Reader
        FormatPostingsDocsWriter/Reader (.frq file) and
        FormatPostingsPositionsWriter/Reader (.prx file).
        

        This part is basically done.

      • Introduces a new "flex" API for iterating through the fields,
        terms, docs and positions:
        FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
        

        This replaces TermEnum/Docs/Positions. SegmentReader emulates the
        old API on top of the new API to keep back-compat.

      Next steps:

      • Plug in new codecs (pulsing, pfor) to exercise the modularity /
        fix any hidden assumptions.
      • Expose new API out of IndexReader, deprecate old API but emulate
        old API on top of new one, switch all core/contrib users to the
        new API.
      • Maybe switch to AttributeSources as the base class for TermsEnum,
        DocsEnum, PostingsEnum – this would give readers API flexibility
        (not just index-file-format flexibility). EG if someone wanted
        to store payload at the term-doc level instead of
        term-doc-position level, you could just add a new attribute.
      • Test performance & iterate.

      Attachments

        1. LUCENE-1458.patch
          116 kB
          Michael McCandless
        2. LUCENE-1458.patch
          167 kB
          Michael McCandless
        3. LUCENE-1458.patch
          188 kB
          Michael McCandless
        4. LUCENE-1458.patch
          263 kB
          Michael McCandless
        5. LUCENE-1458.patch
          370 kB
          Michael McCandless
        6. LUCENE-1458.patch
          360 kB
          Michael Busch
        7. LUCENE-1458.tar.bz2
          1.80 MB
          Michael McCandless
        8. LUCENE-1458-back-compat.patch
          15 kB
          Michael McCandless
        9. LUCENE-1458.tar.bz2
          1.83 MB
          Michael McCandless
        10. LUCENE-1458.tar.bz2
          1.82 MB
          Michael McCandless
        11. LUCENE-1458-back-compat.patch
          15 kB
          Michael McCandless
        12. LUCENE-1458.tar.bz2
          1.83 MB
          Michael McCandless
        13. LUCENE-1458-back-compat.patch
          16 kB
          Michael McCandless
        14. LUCENE-1458.tar.bz2
          1.84 MB
          Michael McCandless
        15. LUCENE-1458-back-compat.patch
          16 kB
          Michael McCandless
        16. LUCENE-1458-back-compat.patch
          22 kB
          Michael McCandless
        17. LUCENE-1458.tar.bz2
          1.94 MB
          Michael McCandless
        18. LUCENE-1458-back-compat.patch
          22 kB
          Michael McCandless
        19. LUCENE-1458.tar.bz2
          1.93 MB
          Michael McCandless
        20. LUCENE-1458.patch
          1015 kB
          Mark Miller
        21. LUCENE-1458.patch
          1024 kB
          Mark Miller
        22. LUCENE-1458.patch
          886 kB
          Michael McCandless
        23. LUCENE-1458.patch
          895 kB
          Michael McCandless
        24. LUCENE-1458.patch
          909 kB
          Michael McCandless
        25. LUCENE-1458.patch
          878 kB
          Mark Miller
        26. LUCENE-1458.patch
          883 kB
          Mark Miller
        27. UnicodeTestCase.patch
          2 kB
          Robert Muir
        28. UnicodeTestCase.patch
          2 kB
          Robert Muir
        29. LUCENE-1458_termenum_bwcompat.patch
          1 kB
          Robert Muir
        30. LUCENE-1458_sortorder_bwcompat.patch
          3 kB
          Robert Muir
        31. LUCENE-1458_rotate.patch
          4 kB
          Robert Muir
        32. LUCENE-1458-NRQ.patch
          12 kB
          Uwe Schindler
        33. LUCENE-1458-MTQ-BW.patch
          2 kB
          Uwe Schindler
        34. LUCENE-1458-DocIdSetIterator.patch
          21 kB
          Uwe Schindler
        35. LUCENE-1458-DocIdSetIterator.patch
          22 kB
          Uwe Schindler

        Issue Links

          Activity

            People

              mikemccand Michael McCandless
              mikemccand Michael McCandless
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: