Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      DocValues fields can be very wasteful if you are storing dates (like solr's TrieDateField does if you enable docvalues) and don't actually need all the precision: e.g. "date-only" fields like date of birth with no time component, time fields without milliseconds precision, and so on.

      Ideally we'd compute GCD of all the values to save space (numberOfTrailingZeros is not really enough here), but i think we should at least look for values like 86400000, 3600000, and 1000 to be practical.

      1. LUCENE-4936.patch
        53 kB
        Robert Muir
      2. LUCENE-4936.patch
        53 kB
        Adrien Grand
      3. LUCENE-4936.patch
        51 kB
        Robert Muir
      4. LUCENE-4936.patch
        49 kB
        Robert Muir
      5. LUCENE-4936.patch
        48 kB
        Adrien Grand
      6. LUCENE-4936.patch
        48 kB
        Adrien Grand
      7. LUCENE-4936.patch
        47 kB
        Adrien Grand
      8. LUCENE-4936.patch
        30 kB
        Adrien Grand
      9. LUCENE-4936.patch
        4 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        here's my hack patch... I think we should do something for DiskDV too though... and of course add tests

        Show
        Robert Muir added a comment - here's my hack patch... I think we should do something for DiskDV too though... and of course add tests
        Hide
        Adrien Grand added a comment -

        Patch:

        • Adds MathUtil.gcd(long, long)
        • Adds "GCD compression" to Lucene42, Disk and CheapBastard.
        • Improves BaseDocValuesFormatTest which almost only tested "TABLE_COMPRESSED" with Lucene42DVF
        • No more attempts to compress storage when the values are known to be dense, such as SORTED ords.

        I measured how slower doc values indexing is with these new checks, and it is completely unnoticeable with random or dense values since the GCD quickly reaches 1. When the GCD is larger, it only made indexing 2% slower (every doc has a single field which is a NumericDocValuesField). So I think it's fine.

        Show
        Adrien Grand added a comment - Patch: Adds MathUtil.gcd(long, long) Adds "GCD compression" to Lucene42, Disk and CheapBastard. Improves BaseDocValuesFormatTest which almost only tested "TABLE_COMPRESSED" with Lucene42DVF No more attempts to compress storage when the values are known to be dense, such as SORTED ords. I measured how slower doc values indexing is with these new checks, and it is completely unnoticeable with random or dense values since the GCD quickly reaches 1. When the GCD is larger, it only made indexing 2% slower (every doc has a single field which is a NumericDocValuesField). So I think it's fine.
        Hide
        Robert Muir added a comment -

        Looks great! I'm glad you were able to make this fast.

        A few ideas:

        • I like the switch with corruption-check on DiskDV. Can we easily integrate this into Lucene42?
        • Can we update the file format docs (we attempt to describe the numerics strategies succinctly here)

        I can do a more thorough review and some additional testing later, but this looks awesome.

        Later we should think about a place (maybe in codec file format docs, maybe even NumericDocValuesField?) to add some practical general guidelines to users, that might not otherwise be intuitive: Stuff like if you are putting Dates in NumericDV, zero out portions you dont care about (e.g. milliseconds, time, etc) to save space, indexing as UTC will be a little more efficient than with local offset, etc.

        Improves BaseDocValuesFormatTest which almost only tested "TABLE_COMPRESSED" with Lucene42DVF

        Yeah this is a good catch! We should also maybe open an issue to review DiskDV and try to make it more efficient. Optimizations like TABLE_COMPRESSED don't exist there I think: it could be handy if someone wants e.g. smallfloat scoring factor. Its nice this patch provides back compat for DiskDV but its not totally necessary in the future, if we want to review and rewrite it. In general that codec was just done very quickly and hasn't seen much benchmarking or anything: could use some work.

        Show
        Robert Muir added a comment - Looks great! I'm glad you were able to make this fast. A few ideas: I like the switch with corruption-check on DiskDV. Can we easily integrate this into Lucene42? Can we update the file format docs (we attempt to describe the numerics strategies succinctly here) I can do a more thorough review and some additional testing later, but this looks awesome. Later we should think about a place (maybe in codec file format docs, maybe even NumericDocValuesField?) to add some practical general guidelines to users, that might not otherwise be intuitive: Stuff like if you are putting Dates in NumericDV, zero out portions you dont care about (e.g. milliseconds, time, etc) to save space, indexing as UTC will be a little more efficient than with local offset, etc. Improves BaseDocValuesFormatTest which almost only tested "TABLE_COMPRESSED" with Lucene42DVF Yeah this is a good catch! We should also maybe open an issue to review DiskDV and try to make it more efficient. Optimizations like TABLE_COMPRESSED don't exist there I think: it could be handy if someone wants e.g. smallfloat scoring factor. Its nice this patch provides back compat for DiskDV but its not totally necessary in the future, if we want to review and rewrite it. In general that codec was just done very quickly and hasn't seen much benchmarking or anything: could use some work.
        Hide
        Robert Muir added a comment -

        indexing as UTC will be a little more efficient than with local offset, etc.

        We could probably solve issues like that too (maybe in something like cheap-bastard codec), if we did a first pass to compute min/max
        and then did GCD only on the deltas from min... right?

        Show
        Robert Muir added a comment - indexing as UTC will be a little more efficient than with local offset, etc. We could probably solve issues like that too (maybe in something like cheap-bastard codec), if we did a first pass to compute min/max and then did GCD only on the deltas from min... right?
        Hide
        Uwe Schindler added a comment -

        indexing as UTC will be a little more efficient than with local offset, etc.

        If you use a NumericField and store the long epoch in it, it is UTC as no localization involved.

        Show
        Uwe Schindler added a comment - indexing as UTC will be a little more efficient than with local offset, etc. If you use a NumericField and store the long epoch in it, it is UTC as no localization involved.
        Hide
        Robert Muir added a comment -

        But NumericField is totally unrelated to docvalues!

        Besides, delta+GCD has other applications than just GMT offset, e.g. solr's Currencyfield (at least in the US, people love prices like 199, 299, 399...): in that case it would save 9 bits per value where it would do nothing with the current patch.

        I'm not arguing the extra pass should be the in the default codec, i just said it might be interesting for cheap-bastard or something.

        Show
        Robert Muir added a comment - But NumericField is totally unrelated to docvalues! Besides, delta+GCD has other applications than just GMT offset, e.g. solr's Currencyfield (at least in the US, people love prices like 199, 299, 399...): in that case it would save 9 bits per value where it would do nothing with the current patch. I'm not arguing the extra pass should be the in the default codec, i just said it might be interesting for cheap-bastard or something.
        Hide
        Robert Muir added a comment -

        Sorry i was thinking about car prices when i said 9bpv, but you get the drift

        Show
        Robert Muir added a comment - Sorry i was thinking about car prices when i said 9bpv, but you get the drift
        Hide
        Uwe Schindler added a comment -

        But NumericField is totally unrelated to docvalues!

        Thats clear. I just said, if you use a LONG docvalues field and store the long epoch its always timezone-less. That was what I wanted to say. This applies to Solr, too.

        Show
        Uwe Schindler added a comment - But NumericField is totally unrelated to docvalues! Thats clear. I just said, if you use a LONG docvalues field and store the long epoch its always timezone-less. That was what I wanted to say. This applies to Solr, too.
        Hide
        Adrien Grand added a comment -

        New patch:

        • Computes the GCD based on deltas in order to be able to compress non-UTC dates.
        • Adds support for TABLE_COMPRESSED to DiskDVF.
        • Adds tests that ensure that these new compression methods are actually used whenever applicable.
        • Adds a quick description of the compression method to Lucene42DVF javadocs.
        Show
        Adrien Grand added a comment - New patch: Computes the GCD based on deltas in order to be able to compress non-UTC dates. Adds support for TABLE_COMPRESSED to DiskDVF. Adds tests that ensure that these new compression methods are actually used whenever applicable. Adds a quick description of the compression method to Lucene42DVF javadocs.
        Hide
        Uwe Schindler added a comment -

        This is really cool! I did not yet completely reviewed it, but i like that it is included in default codec, so user does not need to take care.

        Show
        Uwe Schindler added a comment - This is really cool! I did not yet completely reviewed it, but i like that it is included in default codec, so user does not need to take care.
        Hide
        Adrien Grand added a comment -

        Thank you Uwe! Unfortunately, I just figured out that the patch is broken when v - minValue overflows (in Consumer.addNumericField). I need to think about a way to fix it...

        Show
        Adrien Grand added a comment - Thank you Uwe! Unfortunately, I just figured out that the patch is broken when v - minValue overflows (in Consumer.addNumericField). I need to think about a way to fix it...
        Hide
        Adrien Grand added a comment - - edited

        Here is a work-around for the issue: the consumer stops trying to perform GCD compression as soon as it encounters a value outside the [ -MAX_VALUE/2 , MAX_VALE/2 ] range. This prevents overflows from happening and I can't think of a reasonable use-case that would benefit from GCD compression and have values outside of this range?

        Show
        Adrien Grand added a comment - - edited Here is a work-around for the issue: the consumer stops trying to perform GCD compression as soon as it encounters a value outside the [ -MAX_VALUE/2 , MAX_VALE/2 ] range. This prevents overflows from happening and I can't think of a reasonable use-case that would benefit from GCD compression and have values outside of this range?
        Hide
        Robert Muir added a comment -

        Yeah, i think those are ridiculously large numbers

        I'm gonna try to help review and play with the patch. I think this is great though.

        Show
        Robert Muir added a comment - Yeah, i think those are ridiculously large numbers I'm gonna try to help review and play with the patch. I think this is great though.
        Hide
        Adrien Grand added a comment -

        Thank you Robert, I'd love to have a review to make sure the patch is correct, especially for MathUtil.gcd and the DVConsumer.addNumericField logic.

        Show
        Adrien Grand added a comment - Thank you Robert, I'd love to have a review to make sure the patch is correct, especially for MathUtil.gcd and the DVConsumer.addNumericField logic.
        Hide
        Robert Muir added a comment -

        First thing that sticks out is maybe to remove the extra pass? Even though it just pulls
        the first value...

        For the DiskDV codec i feel its simple, just remove the loop and look for 'count == 0'.
        For Lucene42, its probably best to just add 'count' for the same reason?

        But if it makes things more confusing, maybe just leave it the way it is. Its a little tricky either way

        Show
        Robert Muir added a comment - First thing that sticks out is maybe to remove the extra pass? Even though it just pulls the first value... For the DiskDV codec i feel its simple, just remove the loop and look for 'count == 0'. For Lucene42, its probably best to just add 'count' for the same reason? But if it makes things more confusing, maybe just leave it the way it is. Its a little tricky either way
        Hide
        Adrien Grand added a comment -

        Simple ideas are often the best ones, the new patch has a single loop! Thanks Robert!

        Show
        Adrien Grand added a comment - Simple ideas are often the best ones, the new patch has a single loop! Thanks Robert!
        Hide
        Uwe Schindler added a comment -

        Yeah the duplicate loop was strange much better now.

        One small thing:

        if (v < - Long.MAX_VALUE / 2 || v > Long.MAX_VALUE / 2) {
        

        should maybe changed to, looks more logical to me...

        if (v < Long.MIN_VALUE / 2 || v > Long.MAX_VALUE / 2) {
        
        Show
        Uwe Schindler added a comment - Yeah the duplicate loop was strange much better now. One small thing: if (v < - Long .MAX_VALUE / 2 || v > Long .MAX_VALUE / 2) { should maybe changed to, looks more logical to me... if (v < Long .MIN_VALUE / 2 || v > Long .MAX_VALUE / 2) {
        Hide
        Robert Muir added a comment -

        Same patch: I just uploaded a test.

        For DiskDVProducer, I think the compression tables and so on should be structured to be in the metadata section rather than the data section, to minimize the per-thread instance overhead.

        today its:

            final IndexInput data = this.data.clone();
            data.seek(entry.offset);
        
            final BlockPackedReader reader = new BlockPackedReader(data, entry.packedIntsVersion, entry.blockSize, entry.count, true);
            return new LongNumericDocValues() {
              @Override
              public long get(long id) {
                return reader.get(id);
              }
            };
        

        I'll try to cut this stuff over...

        Show
        Robert Muir added a comment - Same patch: I just uploaded a test. For DiskDVProducer, I think the compression tables and so on should be structured to be in the metadata section rather than the data section, to minimize the per-thread instance overhead. today its: final IndexInput data = this .data.clone(); data.seek(entry.offset); final BlockPackedReader reader = new BlockPackedReader(data, entry.packedIntsVersion, entry.blockSize, entry.count, true ); return new LongNumericDocValues() { @Override public long get( long id) { return reader.get(id); } }; I'll try to cut this stuff over...
        Hide
        Robert Muir added a comment -

        Additionally in all cases we can move the corruption check to where we read the metadata. this means we detect problems earlier, rather than when someone asks for the field the first time. I'll do this too.

        Show
        Robert Muir added a comment - Additionally in all cases we can move the corruption check to where we read the metadata. this means we detect problems earlier, rather than when someone asks for the field the first time. I'll do this too.
        Hide
        Robert Muir added a comment -

        OK: patch with these changes: we write all this stuff in the metadata for diskdv (to reduce overhead), and also stronger/earlier checks on startup (lucene42, too). in getNumeric() i changed it to AssertionError since we already checked it at open time.

        additionally for all corrumption checks, i ensured we did:

        if (something) {
          throw new CorruptIndexException("some message, input=" + meta);
        }
        

        passing the indexinput (meta/data) is very useful, in case someone hits the problem, they get a file name.

        I'll do some more review later...

        Show
        Robert Muir added a comment - OK: patch with these changes: we write all this stuff in the metadata for diskdv (to reduce overhead), and also stronger/earlier checks on startup (lucene42, too). in getNumeric() i changed it to AssertionError since we already checked it at open time. additionally for all corrumption checks, i ensured we did: if (something) { throw new CorruptIndexException( "some message, input=" + meta); } passing the indexinput (meta/data) is very useful, in case someone hits the problem, they get a file name. I'll do some more review later...
        Hide
        Robert Muir added a comment -

        I will generate backwards indexes from 4.2.0 now and commit them to branch_4x/trunk.

        It hurts nothing even if we don't commit this issue

        Show
        Robert Muir added a comment - I will generate backwards indexes from 4.2.0 now and commit them to branch_4x/trunk. It hurts nothing even if we don't commit this issue
        Hide
        Robert Muir added a comment -

        Oh... i checked, i already did this. so when we commit this one, we can just generate 4.4 indexes.

        I think i generated these whenever we changed the format initially so we detect problems earlier than later...

        Show
        Robert Muir added a comment - Oh... i checked, i already did this. so when we commit this one, we can just generate 4.4 indexes. I think i generated these whenever we changed the format initially so we detect problems earlier than later...
        Hide
        Adrien Grand added a comment -

        +1 to the proposed changes!

        Here is an updated patch that fixes the DVProducer constructors to open the data file and check the header in a try/finally block (so that data files are closed even if the header check fails).

        Show
        Adrien Grand added a comment - +1 to the proposed changes! Here is an updated patch that fixes the DVProducer constructors to open the data file and check the header in a try/finally block (so that data files are closed even if the header check fails).
        Hide
        Robert Muir added a comment -

        +1: nice catch adrien.

        Why do we have this logic for TABLE_COMPRESSED in diskdv consumer?

            if (uniqueValues != null
                && ((maxValue - minValue) < 0L || (maxValue - minValue) > 256)
                && count <= Integer.MAX_VALUE) {
        

        Shouldn't this just be:

            if (uniqueValues != null && count <= Integer.MAX_VALUE) {
        

        We only care about the number of unique values, but in this case it does not matter what they actually are.

        Show
        Robert Muir added a comment - +1: nice catch adrien. Why do we have this logic for TABLE_COMPRESSED in diskdv consumer? if (uniqueValues != null && ((maxValue - minValue) < 0L || (maxValue - minValue) > 256) && count <= Integer .MAX_VALUE) { Shouldn't this just be: if (uniqueValues != null && count <= Integer .MAX_VALUE) { We only care about the number of unique values, but in this case it does not matter what they actually are.
        Hide
        Adrien Grand added a comment -

        I guess the point was to avoid one level of indirection in case all values can be stored using a single byte. Maybe "(maxValue - minValue) > 256" should be replaced with "(maxValue - minValue) >= uniqueValues.size()"? This would ensure that table compression isn't used if values are alreadu dense?

        Show
        Adrien Grand added a comment - I guess the point was to avoid one level of indirection in case all values can be stored using a single byte. Maybe "(maxValue - minValue) > 256" should be replaced with "(maxValue - minValue) >= uniqueValues.size()"? This would ensure that table compression isn't used if values are alreadu dense?
        Hide
        Robert Muir added a comment -

        I see. In this case should we just take bitsRequired on both sides?

        Show
        Robert Muir added a comment - I see. In this case should we just take bitsRequired on both sides?
        Hide
        Adrien Grand added a comment -

        One advantage of DELTA_COMPRESSED is that it uses different numbers of bits per value per block. Even if max-min=200, it could still happen that most blocks only require 6 or 7 bits per value. If there are many blocks, this could save substantial disk/memory.

        Show
        Adrien Grand added a comment - One advantage of DELTA_COMPRESSED is that it uses different numbers of bits per value per block. Even if max-min=200, it could still happen that most blocks only require 6 or 7 bits per value. If there are many blocks, this could save substantial disk/memory.
        Hide
        Robert Muir added a comment -

        Only in the case of outliers... but TABLE could do this too, if it sorted its table by increasing frequency of occurrence.

        Show
        Robert Muir added a comment - Only in the case of outliers... but TABLE could do this too, if it sorted its table by increasing frequency of occurrence.
        Hide
        Adrien Grand added a comment -

        In this case should we just take bitsRequired on both sides?

        Yes, this makes sense !

        Show
        Adrien Grand added a comment - In this case should we just take bitsRequired on both sides? Yes, this makes sense !
        Hide
        Robert Muir added a comment -

        (decreasing rather)

        Show
        Robert Muir added a comment - (decreasing rather)
        Hide
        Robert Muir added a comment -

        OK i added the bitsRequired check in this patch... I think we are good to go here.

        +1 to commit

        Show
        Robert Muir added a comment - OK i added the bitsRequired check in this patch... I think we are good to go here. +1 to commit
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] jpountz
        http://svn.apache.org/viewvc?view=revision&revision=1470948

        LUCENE-4936: Improve numeric doc values compression (for dates especially).

        Show
        Commit Tag Bot added a comment - [trunk commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1470948 LUCENE-4936 : Improve numeric doc values compression (for dates especially).
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] jpountz
        http://svn.apache.org/viewvc?view=revision&revision=1470953

        LUCENE-4936: Improve numeric doc values compression (for dates especially).

        Show
        Commit Tag Bot added a comment - [branch_4x commit] jpountz http://svn.apache.org/viewvc?view=revision&revision=1470953 LUCENE-4936 : Improve numeric doc values compression (for dates especially).
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development