Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We can improve the current format in a few ways:

      • speed up Sorted/SortedSet byte[] lookup by structuring the term blocks differently (allow random access, more efficient bulk i/o)
      • speed up reverse lookup by adding a reverse index (small: just every 1024'th term with useless suffixes removed).
      • use slice API for access to access to binary content, too.
      1. LUCENE-5882.patch
        179 kB
        Robert Muir
      2. LUCENE-5882.patch
        172 kB
        Robert Muir
      3. LUCENE-5882.patch
        170 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Patch.

        Also when cardinality is low (there would be no reverse index), compression saves very little RAM, so just encode as variable binary for a little extra speed since its going to be under 8KB ram for addressing anyway.

        Show
        Robert Muir added a comment - Patch. Also when cardinality is low (there would be no reverse index), compression saves very little RAM, so just encode as variable binary for a little extra speed since its going to be under 8KB ram for addressing anyway.
        Hide
        Adrien Grand added a comment -

        +1 The patch looks good!

        And nice use of BytesRefBuilder.

        Show
        Adrien Grand added a comment - +1 The patch looks good! And nice use of BytesRefBuilder.
        Hide
        Michael McCandless added a comment -

        +1, this is a very clean terms dict implementation! Maybe you can rewrite block tree!!

        Show
        Michael McCandless added a comment - +1, this is a very clean terms dict implementation! Maybe you can rewrite block tree!!
        Hide
        Robert Muir added a comment -

        Just adds missing comments/docs.

        I still want to add more tests and simplifications.

        Show
        Robert Muir added a comment - Just adds missing comments/docs. I still want to add more tests and simplifications.
        Hide
        Ryan Ernst added a comment -

        +1, this looks good.

        Show
        Ryan Ernst added a comment - +1, this looks good.
        Hide
        Ryan Ernst added a comment -

        Oh, I did have one minor comment. In the else case of addTermsDict, as well as addReverseTermIndex, I think you can add an assert maxLength > 0, and then remove the Math.max(0, maxLength)?

        Show
        Ryan Ernst added a comment - Oh, I did have one minor comment. In the else case of addTermsDict, as well as addReverseTermIndex, I think you can add an assert maxLength > 0 , and then remove the Math.max(0, maxLength) ?
        Hide
        Robert Muir added a comment -

        Thank you Ryan. Its more than that actually, we had stupidity at read-time too to handle the empty terms case (this can happen when all values are merged away, and yes we test it explicitly).

        I removed the max'ing and replaced with asserts.

        I also added new random termsenum tests to TestLucene410DocValuesFormat. These test the termsenum behavior with large amounts of terms (in nightly very large amounts). It would be nice to factor them into the base class to improve testing of all DVF's, but thats a little more complicated and noisy so I left a TODO. I intend to address it after this issue though.

        Show
        Robert Muir added a comment - Thank you Ryan. Its more than that actually, we had stupidity at read-time too to handle the empty terms case (this can happen when all values are merged away, and yes we test it explicitly). I removed the max'ing and replaced with asserts. I also added new random termsenum tests to TestLucene410DocValuesFormat. These test the termsenum behavior with large amounts of terms (in nightly very large amounts). It would be nice to factor them into the base class to improve testing of all DVF's, but thats a little more complicated and noisy so I left a TODO. I intend to address it after this issue though.
        Hide
        ASF subversion and git services added a comment -

        Commit 1617975 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1617975 ]

        LUCENE-5882: Add 4.10 docvaluesformat

        Show
        ASF subversion and git services added a comment - Commit 1617975 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1617975 ] LUCENE-5882 : Add 4.10 docvaluesformat
        Hide
        ASF subversion and git services added a comment -

        Commit 1617988 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1617988 ]

        LUCENE-5882: Add 4.10 docvaluesformat

        Show
        ASF subversion and git services added a comment - Commit 1617988 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1617988 ] LUCENE-5882 : Add 4.10 docvaluesformat

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development