Lucene - Core
  1. Lucene - Core
  2. LUCENE-6668

Optimize SortedSet/SortedNumeric storage for the few unique sets use-case

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Robert suggested this idea: if there are few unique sets of values, we could build a lookup table and then map each doc to an ord in this table, just like we already do for table compression for numerics.

      I think this is especially compelling given that SortedSet/SortedNumeric are our two only doc values types that use O(maxDoc) memory because of the offsets map. When this new strategy is used, memory usage could be bounded to a constant.

      1. LUCENE-6668.patch
        32 kB
        Adrien Grand
      2. LUCENE-6668.patch
        24 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        Here is a patch: it uses table encoding for SortedSet/SortedNumeric if the sum of the sizes of all unique sets is 256 or less. If my math is correct, this means it will always be used if there are 6 unique values or less (given that the sum of the sizes of all possible subsets would be 192), and might be used if the number of unique values is between 7 and 256.

        Show
        Adrien Grand added a comment - Here is a patch: it uses table encoding for SortedSet/SortedNumeric if the sum of the sizes of all unique sets is 256 or less. If my math is correct, this means it will always be used if there are 6 unique values or less (given that the sum of the sizes of all possible subsets would be 192), and might be used if the number of unique values is between 7 and 256.
        Hide
        Adrien Grand added a comment -

        Updated patch so that BaseDocValuesFormatTestCase explicitely tests both when there are few and many unique sets of values.

        Show
        Adrien Grand added a comment - Updated patch so that BaseDocValuesFormatTestCase explicitely tests both when there are few and many unique sets of values.
        Hide
        Robert Muir added a comment -

        +1, nice to have TABLE applied to the other types here too!

        Show
        Robert Muir added a comment - +1, nice to have TABLE applied to the other types here too!
        Hide
        ASF subversion and git services added a comment -

        Commit 1692058 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1692058 ]

        LUCENE-6668: Added table encoding to sorted set/numeric doc values.

        Show
        ASF subversion and git services added a comment - Commit 1692058 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1692058 ] LUCENE-6668 : Added table encoding to sorted set/numeric doc values.
        Hide
        ASF subversion and git services added a comment -

        Commit 1692061 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1692061 ]

        LUCENE-6668: Added table encoding to sorted set/numeric doc values.

        Show
        ASF subversion and git services added a comment - Commit 1692061 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1692061 ] LUCENE-6668 : Added table encoding to sorted set/numeric doc values.
        Hide
        ASF subversion and git services added a comment -

        Commit 1692069 from Adrien Grand in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1692069 ]

        LUCENE-6668: Add missing Iterator.remove() implementation.

        Show
        ASF subversion and git services added a comment - Commit 1692069 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1692069 ] LUCENE-6668 : Add missing Iterator.remove() implementation.
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development