Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Norms can eat up a lot of RAM, since by default its 8 bits per field per document. We rely upon users to omit them to not blow up RAM, but its a constant trap.

      Previously in 4.2, I tried to compress these by default, but it was too slow. My mistakes were:

      • allowing slow bits per value like bpv=5 that are implemented with expensive operations.
      • trying to wedge norms into the generalized docvalues numeric case
      • not handling "simple" degraded cases like "constant norm" the same norm value for every document.

      Instead, we can just have a separate norms format that is very careful about what it does, since we understand in general the patterns in the data:

      • uses CONSTANT compression (just writes the single value to metadata) when all values are the same.
      • only compresses to bitsPerValue = 1,2,4 (this also happens often, for very short text fields like person names and other stuff in structured data)
      • otherwise, if you would need 5,6,7,8 bits per value, we just continue to do what we do today, encode as byte[]. Maybe we can improve this later, but this ensures we don't have a performance impact.

        Activity

        Hide
        rcmuir Robert Muir added a comment -

        Patch.

        As a simple test, I indexed geonames (its 8M documents):

        Trunk: 158,279,213 bytes RAM
        Patch: 36,446,880 bytes RAM

        Show
        rcmuir Robert Muir added a comment - Patch. As a simple test, I indexed geonames (its 8M documents): Trunk: 158,279,213 bytes RAM Patch: 36,446,880 bytes RAM
        Hide
        jpountz Adrien Grand added a comment -

        +1

        Show
        jpountz Adrien Grand added a comment - +1
        Hide
        jpountz Adrien Grand added a comment -

        I'm wondering if we could have another format that would handle the case when there is a long tail of rare norm values. Eg. if there are 100 unique values but 95% of documents that have only 3 unique values: we could store norm values for these 95% documents using TABLE_COMPRESSED (2 bits per value including 1 special value saying that the norm is not there) and the other ones on disk?

        Show
        jpountz Adrien Grand added a comment - I'm wondering if we could have another format that would handle the case when there is a long tail of rare norm values. Eg. if there are 100 unique values but 95% of documents that have only 3 unique values: we could store norm values for these 95% documents using TABLE_COMPRESSED (2 bits per value including 1 special value saying that the norm is not there) and the other ones on disk?
        Hide
        rcmuir Robert Muir added a comment -

        Adrien, its a good idea, basically a generalization of the sparse case. I wanted to tackle this, but decided against it here, the idea is to just improve lucenes defaults. This patch handles sparsity to some extent via low bPV and constant compression. Nothing sophisticated but I think effective enough as a step.

        Show
        rcmuir Robert Muir added a comment - Adrien, its a good idea, basically a generalization of the sparse case. I wanted to tackle this, but decided against it here, the idea is to just improve lucenes defaults. This patch handles sparsity to some extent via low bPV and constant compression. Nothing sophisticated but I think effective enough as a step.
        Hide
        mikemccand Michael McCandless added a comment -

        +1

        Show
        mikemccand Michael McCandless added a comment - +1
        Hide
        rjernst Ryan Ernst added a comment -

        This looks great!

        One concern: the uniqueValues.toArray() call doesn't guarantee any order right? It doesn't look like it matters for correctness, but I would expect idempotence from the format, at least for reproducibility of tests.

        Show
        rjernst Ryan Ernst added a comment - This looks great! One concern: the uniqueValues.toArray() call doesn't guarantee any order right? It doesn't look like it matters for correctness, but I would expect idempotence from the format, at least for reproducibility of tests.
        Hide
        rcmuir Robert Muir added a comment -

        Its not a property we guarantee (e.g. SegmentInfo.files() set, FieldInfos.attributes(), various other places in the index write unordered sets where it does not matter), but we can add an Arrays.sort, this array is always <= 256 elements.

        Show
        rcmuir Robert Muir added a comment - Its not a property we guarantee (e.g. SegmentInfo.files() set, FieldInfos.attributes(), various other places in the index write unordered sets where it does not matter), but we can add an Arrays.sort, this array is always <= 256 elements.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1601606 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1601606 ]

        LUCENE-5743: Add Lucene49NormsFormat

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1601606 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1601606 ] LUCENE-5743 : Add Lucene49NormsFormat
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1601625 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1601625 ]

        LUCENE-5743: Add Lucene49NormsFormat

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1601625 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1601625 ] LUCENE-5743 : Add Lucene49NormsFormat
        Hide
        rcmuir Robert Muir added a comment -

        I added the Arrays.sort(), also a step towards a BaseNormsFormatTestCase. I've always been concerned that we didnt have enough stuff testing the norms directly...

        Show
        rcmuir Robert Muir added a comment - I added the Arrays.sort(), also a step towards a BaseNormsFormatTestCase. I've always been concerned that we didnt have enough stuff testing the norms directly...

          People

          • Assignee:
            Unassigned
            Reporter:
            rcmuir Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development