Lucene - Core
  1. Lucene - Core
  2. LUCENE-5159

compressed diskdv sorted/sortedset termdictionaries

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Sorted/SortedSet give you ordinal(s) per document, but them separately have a "term dictionary" of all the values.

      You can do a few operations on these:

      • ord -> term lookup (e.g. retrieving facet labels)
      • term -> ord lookup (reverse lookup: e.g. fieldcacherangefilter)
      • get a term enumerator (e.g. merging, ordinalmap construction)

      The current implementation for diskdv was the simplest thing that can possibly work: under the hood it just makes a binary DV for these (treating ordinals as document ids). When the terms are fixed length, you can address a term directly with multiplication. When they are variable length though, we have to store a packed ints structure in RAM.

      This variable length case is overkill and chews up a lot of RAM if you have many unique values. It also chews up a lot of disk since all the values are just concatenated (no sharing).

      1. LUCENE-5159.patch
        31 kB
        Robert Muir
      2. LUCENE-5159.patch
        24 kB
        Robert Muir
      3. LUCENE-5159.patch
        24 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Here's an in-progress patch... all the core/codec tests pass, but I'm sure there are a few bugs to knock out (improving the tests is the way to go here).

        I'm also unhappy with the complexity.

        The idea is for the variable case, we just prefix-share (i set interval=16), like lucene 3.x dictionary. The current patch specializes the termsenum and reverselookup for this case (but again, im sure there are bugs, its hairy)

        Show
        Robert Muir added a comment - Here's an in-progress patch... all the core/codec tests pass, but I'm sure there are a few bugs to knock out (improving the tests is the way to go here). I'm also unhappy with the complexity. The idea is for the variable case, we just prefix-share (i set interval=16), like lucene 3.x dictionary. The current patch specializes the termsenum and reverselookup for this case (but again, im sure there are bugs, its hairy)
        Hide
        Robert Muir added a comment -

        fixes a OB1 bug. ill beef up the DV base test case to really exercise this termsenum...

        Show
        Robert Muir added a comment - fixes a OB1 bug. ill beef up the DV base test case to really exercise this termsenum...
        Hide
        Michael McCandless added a comment -

        +1, patch looks great.

        Show
        Michael McCandless added a comment - +1, patch looks great.
        Hide
        Robert Muir added a comment -

        Patch: I made some code cleanups and beefed up BaseDocValuesFormatTestCase.

        I think its ready.

        Show
        Robert Muir added a comment - Patch: I made some code cleanups and beefed up BaseDocValuesFormatTestCase. I think its ready.
        Hide
        Michael McCandless added a comment -

        +1, patch looks great. I love the new test case!

        Show
        Michael McCandless added a comment - +1, patch looks great. I love the new test case!
        Hide
        ASF subversion and git services added a comment -

        Commit 1512543 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1512543 ]

        LUCENE-5159: prefix-code the sorted/sortedset value dictionaries in DiskDV

        Show
        ASF subversion and git services added a comment - Commit 1512543 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1512543 ] LUCENE-5159 : prefix-code the sorted/sortedset value dictionaries in DiskDV
        Hide
        ASF subversion and git services added a comment -

        Commit 1512548 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1512548 ]

        LUCENE-5159: prefix-code the sorted/sortedset value dictionaries in DiskDV

        Show
        ASF subversion and git services added a comment - Commit 1512548 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1512548 ] LUCENE-5159 : prefix-code the sorted/sortedset value dictionaries in DiskDV
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development