Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Indexes values to disk but at search time it loads/accesses the values via simple java arrays (i.e. no compression).

      1. LUCENE-5296.patch
        35 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch.

        Show
        Michael McCandless added a comment - Patch.
        Hide
        Adrien Grand added a comment -

        This looks good to me, I'm just curious why you decided to implement a dedicated consumer instead of reusing internally eg. DiskDocValuesFormat in a similar fashion to what DirectPostingsFormat does with Lucene41PostingsFormat? Is it to avoid doing too much work upon reopen to compute things like byte widths for numerics?

        Show
        Adrien Grand added a comment - This looks good to me, I'm just curious why you decided to implement a dedicated consumer instead of reusing internally eg. DiskDocValuesFormat in a similar fashion to what DirectPostingsFormat does with Lucene41PostingsFormat? Is it to avoid doing too much work upon reopen to compute things like byte widths for numerics?
        Hide
        Michael McCandless added a comment -

        Thanks Adrien.

        I'm just curious why you decided to implement a dedicated consumer instead of reusing internally eg. DiskDocValuesFormat in a similar fashion to what DirectPostingsFormat does with Lucene41PostingsFormat? Is it to avoid doing too much work upon reopen to compute things like byte widths for numerics?

        Actually, it hadn't occurred to me to "wrap" like we did for DirectPF, I guess because a DVF is so much easier to write than a PF ... we could consider doing that.

        But, I do think it's good to minimize work on loading values at search time... I actually started from MemoryDVF/C/P and then iterated to the "simple arrays".

        Show
        Michael McCandless added a comment - Thanks Adrien. I'm just curious why you decided to implement a dedicated consumer instead of reusing internally eg. DiskDocValuesFormat in a similar fashion to what DirectPostingsFormat does with Lucene41PostingsFormat? Is it to avoid doing too much work upon reopen to compute things like byte widths for numerics? Actually, it hadn't occurred to me to "wrap" like we did for DirectPF, I guess because a DVF is so much easier to write than a PF ... we could consider doing that. But, I do think it's good to minimize work on loading values at search time... I actually started from MemoryDVF/C/P and then iterated to the "simple arrays".
        Hide
        Adrien Grand added a comment -

        OK, thanks for the explanation! I have a few other remarks/questions on the patch:

        • why do you substract 200 to Integer.MAX_VALUE to compute the maximum number of bytes/ords?
        • sum looks unused in addNumericFieldValues

        Otherwise, +1 to commit.

        Show
        Adrien Grand added a comment - OK, thanks for the explanation! I have a few other remarks/questions on the patch: why do you substract 200 to Integer.MAX_VALUE to compute the maximum number of bytes/ords? sum looks unused in addNumericFieldValues Otherwise, +1 to commit.
        Hide
        Michael McCandless added a comment -

        Thanks for the review Adrien!

        why do you substract 200 to Integer.MAX_VALUE to compute the maximum number of bytes/ords?

        That's because the exact maximum size for an array seems to vary w/ JVMs to some "small" amount less than Integer.MAX_VALUE. I'll put a comment explaining this ... actually, we also do this in BinaryDocValuesWriter; I'll factor it out & share it.

        sum looks unused in addNumericFieldValues

        Woops, I'll remove.

        Show
        Michael McCandless added a comment - Thanks for the review Adrien! why do you substract 200 to Integer.MAX_VALUE to compute the maximum number of bytes/ords? That's because the exact maximum size for an array seems to vary w/ JVMs to some "small" amount less than Integer.MAX_VALUE. I'll put a comment explaining this ... actually, we also do this in BinaryDocValuesWriter; I'll factor it out & share it. sum looks unused in addNumericFieldValues Woops, I'll remove.
        Hide
        ASF subversion and git services added a comment -

        Commit 1537105 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1537105 ]

        LUCENE-5296: add DirectDocValuesFormat

        Show
        ASF subversion and git services added a comment - Commit 1537105 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1537105 ] LUCENE-5296 : add DirectDocValuesFormat
        Hide
        ASF subversion and git services added a comment -

        Commit 1537108 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1537108 ]

        LUCENE-5296: add DirectDocValuesFormat

        Show
        ASF subversion and git services added a comment - Commit 1537108 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1537108 ] LUCENE-5296 : add DirectDocValuesFormat
        Hide
        Michael McCandless added a comment -

        Thanks Adrien!

        Show
        Michael McCandless added a comment - Thanks Adrien!
        Hide
        ASF subversion and git services added a comment -

        Commit 1537140 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1537140 ]

        LUCENE-5296: clarify the 2.1B value count limit for sorted set field

        Show
        ASF subversion and git services added a comment - Commit 1537140 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1537140 ] LUCENE-5296 : clarify the 2.1B value count limit for sorted set field
        Hide
        ASF subversion and git services added a comment -

        Commit 1537141 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1537141 ]

        LUCENE-5296: clarify the 2.1B value count limit for sorted set field

        Show
        ASF subversion and git services added a comment - Commit 1537141 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1537141 ] LUCENE-5296 : clarify the 2.1B value count limit for sorted set field
        Hide
        Shai Erera added a comment -

        Now that we have this codec, does it make sense to keep FacetDVF? As far as I can tell, the only difference is that FacetDVF keeps the addresses as PackedInts while DirectDVF as int[]?

        Show
        Shai Erera added a comment - Now that we have this codec, does it make sense to keep FacetDVF? As far as I can tell, the only difference is that FacetDVF keeps the addresses as PackedInts while DirectDVF as int[]?
        Hide
        Michael McCandless added a comment -

        Now that we have this codec, does it make sense to keep FacetDVF? As far as I can tell, the only difference is that FacetDVF keeps the addresses as PackedInts while DirectDVF as int[]?

        Hmm that's a good question. I'll test the two...

        Show
        Michael McCandless added a comment - Now that we have this codec, does it make sense to keep FacetDVF? As far as I can tell, the only difference is that FacetDVF keeps the addresses as PackedInts while DirectDVF as int[]? Hmm that's a good question. I'll test the two...
        Hide
        Michael McCandless added a comment -

        The difference in RAM usage was tiny in my test: 139.9 MB for
        Facet42DVF and 140.7 MB for DirectDVF. For smaller indices, or
        indices w/ fewer facet fields, the difference could be bigger ...

        Perf change:

        Report after iter 19:
                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         Respell       53.84      (3.1%)       51.27      (3.3%)   -4.8% ( -10% -    1%)
                          Fuzzy2       25.96      (1.9%)       24.90      (2.0%)   -4.1% (  -7% -    0%)
                          Fuzzy1       31.56      (2.2%)       31.29      (2.4%)   -0.9% (  -5% -    3%)
                      AndHighLow       60.74      (2.9%)       60.25      (2.7%)   -0.8% (  -6% -    5%)
                    OrNotHighLow       33.05      (4.0%)       33.25      (4.2%)    0.6% (  -7% -    9%)
                     LowSpanNear        7.54      (3.3%)        7.62      (3.7%)    1.1% (  -5% -    8%)
                      AndHighMed       17.44      (1.3%)       17.65      (1.1%)    1.2% (  -1% -    3%)
                         LowTerm       42.22      (2.4%)       42.83      (2.1%)    1.5% (  -3% -    6%)
                     MedSpanNear       19.07      (2.4%)       19.37      (2.4%)    1.6% (  -3% -    6%)
                    OrNotHighMed       20.66      (3.3%)       20.99      (3.6%)    1.6% (  -5% -    8%)
                 LowSloppyPhrase       25.21      (1.8%)       25.66      (1.9%)    1.8% (  -1% -    5%)
                     AndHighHigh       15.31      (1.4%)       15.60      (1.3%)    1.9% (   0% -    4%)
                       LowPhrase        9.69      (5.4%)        9.89      (5.2%)    2.1% (  -8% -   13%)
                         Prefix3       15.75      (1.5%)       16.16      (1.6%)    2.6% (   0% -    5%)
                    HighSpanNear        3.87      (3.0%)        3.99      (3.1%)    3.0% (  -2% -    9%)
                       MedPhrase       45.71      (3.4%)       47.19      (3.7%)    3.2% (  -3% -   10%)
                 MedSloppyPhrase        3.30      (6.2%)        3.42      (7.8%)    3.6% (  -9% -   18%)
                        HighTerm       10.69      (1.3%)       11.09      (1.4%)    3.7% (   0% -    6%)
                    OrHighNotMed        9.07      (1.2%)        9.46      (1.7%)    4.3% (   1% -    7%)
                        Wildcard        5.84      (1.1%)        6.15      (1.5%)    5.3% (   2% -    8%)
                HighSloppyPhrase        3.43      (8.0%)        3.62     (11.4%)    5.5% ( -12% -   27%)
                         MedTerm       15.65      (1.6%)       16.51      (2.0%)    5.5% (   1% -    9%)
                      HighPhrase        3.14      (6.8%)        3.32      (6.9%)    5.5% (  -7% -   20%)
                       OrHighMed        7.25      (1.4%)        7.66      (1.8%)    5.7% (   2% -    8%)
                   OrHighNotHigh        6.13      (1.8%)        6.50      (2.2%)    6.0% (   1% -   10%)
                    OrHighNotLow        5.44      (1.3%)        5.82      (1.4%)    7.0% (   4% -    9%)
                   OrNotHighHigh       12.42      (1.9%)       13.31      (2.3%)    7.2% (   2% -   11%)
                       OrHighLow        3.83      (1.7%)        4.27      (2.1%)   11.4% (   7% -   15%)
                      OrHighHigh        2.85      (1.7%)        3.23      (2.1%)   13.2% (   9% -   17%)
                          IntNRQ        2.13      (1.5%)        2.50      (1.2%)   17.1% (  14% -   20%)
        

        I think we should remove Facet42DVF/Codec?

        Show
        Michael McCandless added a comment - The difference in RAM usage was tiny in my test: 139.9 MB for Facet42DVF and 140.7 MB for DirectDVF. For smaller indices, or indices w/ fewer facet fields, the difference could be bigger ... Perf change: Report after iter 19: Task QPS base StdDev QPS comp StdDev Pct diff Respell 53.84 (3.1%) 51.27 (3.3%) -4.8% ( -10% - 1%) Fuzzy2 25.96 (1.9%) 24.90 (2.0%) -4.1% ( -7% - 0%) Fuzzy1 31.56 (2.2%) 31.29 (2.4%) -0.9% ( -5% - 3%) AndHighLow 60.74 (2.9%) 60.25 (2.7%) -0.8% ( -6% - 5%) OrNotHighLow 33.05 (4.0%) 33.25 (4.2%) 0.6% ( -7% - 9%) LowSpanNear 7.54 (3.3%) 7.62 (3.7%) 1.1% ( -5% - 8%) AndHighMed 17.44 (1.3%) 17.65 (1.1%) 1.2% ( -1% - 3%) LowTerm 42.22 (2.4%) 42.83 (2.1%) 1.5% ( -3% - 6%) MedSpanNear 19.07 (2.4%) 19.37 (2.4%) 1.6% ( -3% - 6%) OrNotHighMed 20.66 (3.3%) 20.99 (3.6%) 1.6% ( -5% - 8%) LowSloppyPhrase 25.21 (1.8%) 25.66 (1.9%) 1.8% ( -1% - 5%) AndHighHigh 15.31 (1.4%) 15.60 (1.3%) 1.9% ( 0% - 4%) LowPhrase 9.69 (5.4%) 9.89 (5.2%) 2.1% ( -8% - 13%) Prefix3 15.75 (1.5%) 16.16 (1.6%) 2.6% ( 0% - 5%) HighSpanNear 3.87 (3.0%) 3.99 (3.1%) 3.0% ( -2% - 9%) MedPhrase 45.71 (3.4%) 47.19 (3.7%) 3.2% ( -3% - 10%) MedSloppyPhrase 3.30 (6.2%) 3.42 (7.8%) 3.6% ( -9% - 18%) HighTerm 10.69 (1.3%) 11.09 (1.4%) 3.7% ( 0% - 6%) OrHighNotMed 9.07 (1.2%) 9.46 (1.7%) 4.3% ( 1% - 7%) Wildcard 5.84 (1.1%) 6.15 (1.5%) 5.3% ( 2% - 8%) HighSloppyPhrase 3.43 (8.0%) 3.62 (11.4%) 5.5% ( -12% - 27%) MedTerm 15.65 (1.6%) 16.51 (2.0%) 5.5% ( 1% - 9%) HighPhrase 3.14 (6.8%) 3.32 (6.9%) 5.5% ( -7% - 20%) OrHighMed 7.25 (1.4%) 7.66 (1.8%) 5.7% ( 2% - 8%) OrHighNotHigh 6.13 (1.8%) 6.50 (2.2%) 6.0% ( 1% - 10%) OrHighNotLow 5.44 (1.3%) 5.82 (1.4%) 7.0% ( 4% - 9%) OrNotHighHigh 12.42 (1.9%) 13.31 (2.3%) 7.2% ( 2% - 11%) OrHighLow 3.83 (1.7%) 4.27 (2.1%) 11.4% ( 7% - 15%) OrHighHigh 2.85 (1.7%) 3.23 (2.1%) 13.2% ( 9% - 17%) IntNRQ 2.13 (1.5%) 2.50 (1.2%) 17.1% ( 14% - 20%) I think we should remove Facet42DVF/Codec?
        Hide
        Shai Erera added a comment -

        +1. I opened LUCENE-5321 since I want to address FacetCodec changes in general.

        Show
        Shai Erera added a comment - +1. I opened LUCENE-5321 since I want to address FacetCodec changes in general.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development