Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Norms can eat up a lot of RAM, since by default its 8 bits per field per document. We rely upon users to omit them to not blow up RAM, but its a constant trap.

      Previously in 4.2, I tried to compress these by default, but it was too slow. My mistakes were:

      • allowing slow bits per value like bpv=5 that are implemented with expensive operations.
      • trying to wedge norms into the generalized docvalues numeric case
      • not handling "simple" degraded cases like "constant norm" the same norm value for every document.

      Instead, we can just have a separate norms format that is very careful about what it does, since we understand in general the patterns in the data:

      • uses CONSTANT compression (just writes the single value to metadata) when all values are the same.
      • only compresses to bitsPerValue = 1,2,4 (this also happens often, for very short text fields like person names and other stuff in structured data)
      • otherwise, if you would need 5,6,7,8 bits per value, we just continue to do what we do today, encode as byte[]. Maybe we can improve this later, but this ensures we don't have a performance impact.

        Activity

        Hide
        Robert Muir added a comment -

        Patch.

        As a simple test, I indexed geonames (its 8M documents):

        Trunk: 158,279,213 bytes RAM
        Patch: 36,446,880 bytes RAM

        Show
        Robert Muir added a comment - Patch. As a simple test, I indexed geonames (its 8M documents): Trunk: 158,279,213 bytes RAM Patch: 36,446,880 bytes RAM
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        Adrien Grand added a comment -

        I'm wondering if we could have another format that would handle the case when there is a long tail of rare norm values. Eg. if there are 100 unique values but 95% of documents that have only 3 unique values: we could store norm values for these 95% documents using TABLE_COMPRESSED (2 bits per value including 1 special value saying that the norm is not there) and the other ones on disk?

        Show
        Adrien Grand added a comment - I'm wondering if we could have another format that would handle the case when there is a long tail of rare norm values. Eg. if there are 100 unique values but 95% of documents that have only 3 unique values: we could store norm values for these 95% documents using TABLE_COMPRESSED (2 bits per value including 1 special value saying that the norm is not there) and the other ones on disk?
        Hide
        Robert Muir added a comment -

        Adrien, its a good idea, basically a generalization of the sparse case. I wanted to tackle this, but decided against it here, the idea is to just improve lucenes defaults. This patch handles sparsity to some extent via low bPV and constant compression. Nothing sophisticated but I think effective enough as a step.

        Show
        Robert Muir added a comment - Adrien, its a good idea, basically a generalization of the sparse case. I wanted to tackle this, but decided against it here, the idea is to just improve lucenes defaults. This patch handles sparsity to some extent via low bPV and constant compression. Nothing sophisticated but I think effective enough as a step.
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Ryan Ernst added a comment -

        This looks great!

        One concern: the uniqueValues.toArray() call doesn't guarantee any order right? It doesn't look like it matters for correctness, but I would expect idempotence from the format, at least for reproducibility of tests.

        Show
        Ryan Ernst added a comment - This looks great! One concern: the uniqueValues.toArray() call doesn't guarantee any order right? It doesn't look like it matters for correctness, but I would expect idempotence from the format, at least for reproducibility of tests.
        Hide
        Robert Muir added a comment -

        Its not a property we guarantee (e.g. SegmentInfo.files() set, FieldInfos.attributes(), various other places in the index write unordered sets where it does not matter), but we can add an Arrays.sort, this array is always <= 256 elements.

        Show
        Robert Muir added a comment - Its not a property we guarantee (e.g. SegmentInfo.files() set, FieldInfos.attributes(), various other places in the index write unordered sets where it does not matter), but we can add an Arrays.sort, this array is always <= 256 elements.
        Hide
        ASF subversion and git services added a comment -

        Commit 1601606 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1601606 ]

        LUCENE-5743: Add Lucene49NormsFormat

        Show
        ASF subversion and git services added a comment - Commit 1601606 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1601606 ] LUCENE-5743 : Add Lucene49NormsFormat
        Hide
        ASF subversion and git services added a comment -

        Commit 1601625 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1601625 ]

        LUCENE-5743: Add Lucene49NormsFormat

        Show
        ASF subversion and git services added a comment - Commit 1601625 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1601625 ] LUCENE-5743 : Add Lucene49NormsFormat
        Hide
        Robert Muir added a comment -

        I added the Arrays.sort(), also a step towards a BaseNormsFormatTestCase. I've always been concerned that we didnt have enough stuff testing the norms directly...

        Show
        Robert Muir added a comment - I added the Arrays.sort(), also a step towards a BaseNormsFormatTestCase. I've always been concerned that we didnt have enough stuff testing the norms directly...

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development