Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 5.4, 6.0
    • None
    • None
    • New

    Description

      For both NUMERIC fields and ordinals of SORTED fields, we store data in a dense way. As a consequence, if you have only 1000 documents out of 1B that have a value, and 8 bits are required to store those 1000 numbers, we will not require 1KB of storage, but 1GB.

      I suspect this mostly happens in abuse cases, but still it's a pity that we explode storage requirements. We could try to detect sparsity and compress accordingly.

      Attachments

        1. LUCENE-6863.patch
          38 kB
          Adrien Grand
        2. LUCENE-6863.patch
          35 kB
          Adrien Grand
        3. LUCENE-6863.patch
          33 kB
          Adrien Grand

        Issue Links

          Activity

            jpountz Adrien Grand added a comment -

            Here is a patch. It detects sparsity by checking if less than 10% documents have a value. When this happens, it stores doc IDs that have a value using a DirectMonotonicWriter and non-missing values as a regular numeric field (so values will still be compressed with table/gcd compression if applicable).

            I first wanted to update the format javadocs but noticed that the low-level javadocs were very outdated (still documenting that sorted doc values were stored in a FST) so I decided to remove the low-level description in favour of the high-level one, which is much more interesting in my opinion.

            This is just a first iteration, tests pass but maybe the heuristic needs to be better thought. I will also do some benchmarking.

            jpountz Adrien Grand added a comment - Here is a patch. It detects sparsity by checking if less than 10% documents have a value. When this happens, it stores doc IDs that have a value using a DirectMonotonicWriter and non-missing values as a regular numeric field (so values will still be compressed with table/gcd compression if applicable). I first wanted to update the format javadocs but noticed that the low-level javadocs were very outdated (still documenting that sorted doc values were stored in a FST) so I decided to remove the low-level description in favour of the high-level one, which is much more interesting in my opinion. This is just a first iteration, tests pass but maybe the heuristic needs to be better thought. I will also do some benchmarking.
            yseeley@gmail.com Yonik Seeley added a comment -

            +1 for sparse support! Sparse fields are a fact of life for a lot of users. Sometimes they can encode the info into a common field, but it's often not straight-forward to do so.

            yseeley@gmail.com Yonik Seeley added a comment - +1 for sparse support! Sparse fields are a fact of life for a lot of users. Sometimes they can encode the info into a common field, but it's often not straight-forward to do so.
            rcmuir Robert Muir added a comment -

            Here is a patch. It detects sparsity by checking if less than 10% documents have a value.

            Please, put a minimum count like 1024 where this starts to make sense, so types arent flipping back and forth during ordinary NRT search. And where did 10% come from? It seems far too aggressive. In previous work we determined much lower thresholds to work better (e.g. 3%).

            rcmuir Robert Muir added a comment - Here is a patch. It detects sparsity by checking if less than 10% documents have a value. Please, put a minimum count like 1024 where this starts to make sense, so types arent flipping back and forth during ordinary NRT search. And where did 10% come from? It seems far too aggressive. In previous work we determined much lower thresholds to work better (e.g. 3%).
            rcmuir Robert Muir added a comment -

            I think this needs to be a very low number, if its going to be "automatic", like 3%, pure abuse case detection, just like how sparse norms worked.

            Otherwise, it needs to be a separate format in codecs/ that is explicit opt-in by the user.

            rcmuir Robert Muir added a comment - I think this needs to be a very low number, if its going to be "automatic", like 3%, pure abuse case detection, just like how sparse norms worked. Otherwise, it needs to be a separate format in codecs/ that is explicit opt-in by the user.
            jpountz Adrien Grand added a comment -

            And where did 10% come from? It seems far too aggressive.

            It was just a start so that I could easily trigger usage of this compression in tests. I agree this is very likely too aggressive, this is why I said the heuristic needs to be better thought. I will also do some benchmarking, if this ends up being much slower than regular (delta) compression, we might want to have an even lower threshold.

            jpountz Adrien Grand added a comment - And where did 10% come from? It seems far too aggressive. It was just a start so that I could easily trigger usage of this compression in tests. I agree this is very likely too aggressive, this is why I said the heuristic needs to be better thought. I will also do some benchmarking, if this ends up being much slower than regular (delta) compression, we might want to have an even lower threshold.
            yseeley@gmail.com Yonik Seeley added a comment -

            It was just a start so that I could easily trigger usage of this compression in tests.

            Adding a small constant always helps with test coverage... (n*0.05 + 5) for example.

            yseeley@gmail.com Yonik Seeley added a comment - It was just a start so that I could easily trigger usage of this compression in tests. Adding a small constant always helps with test coverage... (n*0.05 + 5) for example.
            varun Varun Thacker added a comment -

            Hi Adrien,

            Great idea to use heuristics and use it automatically. Looking forward to the benchmarks which I never got to in LUCENE-5688

            Is it okay if we mark LUCENE-5688 and LUCENE-4921 as duplicate of this Jira or could there still be plans on having a specialized doc value format?

            varun Varun Thacker added a comment - Hi Adrien, Great idea to use heuristics and use it automatically. Looking forward to the benchmarks which I never got to in LUCENE-5688 Is it okay if we mark LUCENE-5688 and LUCENE-4921 as duplicate of this Jira or could there still be plans on having a specialized doc value format?
            jpountz Adrien Grand added a comment -

            Is it okay if we mark LUCENE-5688 and LUCENE-4921 as duplicate of this Jira or could there still be plans on having a specialized doc value format?

            I think it would be a bit premature to say the other jiras are superseded by this one, it's not clear yet whether the proposed approach here is actually a better idea and/or could make it to the default codec. I suggest that we wait to see how happy benchmarks are with this patch first.

            jpountz Adrien Grand added a comment - Is it okay if we mark LUCENE-5688 and LUCENE-4921 as duplicate of this Jira or could there still be plans on having a specialized doc value format? I think it would be a bit premature to say the other jiras are superseded by this one, it's not clear yet whether the proposed approach here is actually a better idea and/or could make it to the default codec. I suggest that we wait to see how happy benchmarks are with this patch first.
            jpountz Adrien Grand added a comment -

            I ran some benchmarks with the geoname dataset which has a few sparse fields:

            • cc2: 3.2% of documents have this field, which has 573 unique values
            • admin4: 4.3% of documents have this field, which has 102950 unique values
            • admin3: 10.2% of documents have this field, which has 73120 unique values
            • admin2: 45.3% of documents have this field, which has 30603 unique values

            First I enabled sparse compression on all fields, regardless of density to see how this compares to the delta compression that we use by default, and then ran two kinds of queries:

            • queries on a random partition of the index, which I guess would be the case when you have true sparse fields
            • a query only on documents that have a value, which I guess would be more realistic if you store several types of data in the same index that don't have the same fields
            Field disk usage for ordinals memory usage with sparse compression enabled sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field
            cc2 -88% 1680 bytes -27% +25% +58% +208%
            admin4 -86% 568 bytes -20% +7% -20% +214%
            admin3 -67% 1312 bytes +11% +57% +42% +236%
            admin2 +17% 2904 bytes +132% +275% +331% +221%

            The reduction in disk usage is significant, but so is the slowdown, especially when running a query that only matches docs that have a value. However memory usage looks acceptable to me for 10M docs.

            I couldn't test with 3% as even the rarest field is contained by 3.2% of documents, but I updated the heuristic to require at least 1024 docs in the segment (like Robert suggested) and that less than 5% of docs have a value:

            Field memory usage due to sparse compression sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field
            cc2 1680 bytes -10% +34% +62% +214%
            admin4 568 bytes -7% +20% -14% +241%
            admin3 576 bytes +9% +7% +11% +10%
            admin2 1008 bytes +1% +8% +9% +11%

            To my surprise, admin2 and admin3 were still using sparse compression on some segments. The reason is that documents with sparse values are not uniform in the dataset but rather clustered: I suspect this partially explains of the slowdown for admin2/admin3, maybe there is also hotspot not liking having more impls to deal with.

            jpountz Adrien Grand added a comment - I ran some benchmarks with the geoname dataset which has a few sparse fields: cc2: 3.2% of documents have this field, which has 573 unique values admin4: 4.3% of documents have this field, which has 102950 unique values admin3: 10.2% of documents have this field, which has 73120 unique values admin2: 45.3% of documents have this field, which has 30603 unique values First I enabled sparse compression on all fields, regardless of density to see how this compares to the delta compression that we use by default, and then ran two kinds of queries: queries on a random partition of the index, which I guess would be the case when you have true sparse fields a query only on documents that have a value, which I guess would be more realistic if you store several types of data in the same index that don't have the same fields Field disk usage for ordinals memory usage with sparse compression enabled sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field cc2 -88% 1680 bytes -27% +25% +58% +208% admin4 -86% 568 bytes -20% +7% -20% +214% admin3 -67% 1312 bytes +11% +57% +42% +236% admin2 +17% 2904 bytes +132% +275% +331% +221% The reduction in disk usage is significant, but so is the slowdown, especially when running a query that only matches docs that have a value. However memory usage looks acceptable to me for 10M docs. I couldn't test with 3% as even the rarest field is contained by 3.2% of documents, but I updated the heuristic to require at least 1024 docs in the segment (like Robert suggested) and that less than 5% of docs have a value: Field memory usage due to sparse compression sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field cc2 1680 bytes -10% +34% +62% +214% admin4 568 bytes -7% +20% -14% +241% admin3 576 bytes +9% +7% +11% +10% admin2 1008 bytes +1% +8% +9% +11% To my surprise, admin2 and admin3 were still using sparse compression on some segments. The reason is that documents with sparse values are not uniform in the dataset but rather clustered: I suspect this partially explains of the slowdown for admin2/admin3, maybe there is also hotspot not liking having more impls to deal with.
            jpountz Adrien Grand added a comment -

            Here is an updated patch that makes the sparse impl a bit more efficient when consumed in sequential order by keeping track of the upper bound of the current window. This is what has been used in the above benchmark.

            I also updated the heuristic to require 1024 docs in the segment and that less than 1% of docs have a value in order to be on the safe side and to only slow down abuse/exceptionnal cases. Even if/when this gets used for some fields, I think the slowdown is acceptable insofar as it would only slow down fast queries: if you look at the above benchmarks, when the query matches many docs (such as a MatchAllDocsQuery) this encoding is actually faster than regular delta encoding. Only queries that match a small partition of the index (so that most dv lookups will require a binary search) would become slower.

            Opinions?

            jpountz Adrien Grand added a comment - Here is an updated patch that makes the sparse impl a bit more efficient when consumed in sequential order by keeping track of the upper bound of the current window. This is what has been used in the above benchmark. I also updated the heuristic to require 1024 docs in the segment and that less than 1% of docs have a value in order to be on the safe side and to only slow down abuse/exceptionnal cases. Even if/when this gets used for some fields, I think the slowdown is acceptable insofar as it would only slow down fast queries: if you look at the above benchmarks, when the query matches many docs (such as a MatchAllDocsQuery) this encoding is actually faster than regular delta encoding. Only queries that match a small partition of the index (so that most dv lookups will require a binary search) would become slower. Opinions?
            dsmiley David Smiley added a comment -

            Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688? I just read the comments there and it seems promising for very sparse data.

            Regarding the performance trade-off in your table – I find it hard to evaluate to consider if it's worth it. Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long? Or is this measurement isolated to... somehow just the sort part? How fast was this query any way? If we're making a 3ms query take 9ms then it wouldn't bother me so much as a 300ms query taking 900ms. Of course it depends on the amount of data.

            dsmiley David Smiley added a comment - Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688 ? I just read the comments there and it seems promising for very sparse data. Regarding the performance trade-off in your table – I find it hard to evaluate to consider if it's worth it. Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long? Or is this measurement isolated to... somehow just the sort part? How fast was this query any way? If we're making a 3ms query take 9ms then it wouldn't bother me so much as a 300ms query taking 900ms. Of course it depends on the amount of data.
            jpountz Adrien Grand added a comment -

            Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688? I just read the comments there and it seems promising for very sparse data.

            I took this approach because I saw a couple of setups that had lots of fields, many of those being sparse and the fact that sparse fields require as much resources as the dense ones is a bit frustrating. The issue description focuses on disk usage, but I think memory usage is even more important. Obviously I had to increase memory usage (since numerics don't require any memory at all today) but I wanted to keep it very low. For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache. In contrast, the patch keeps memory usage very low and keeps the access pattern sequential by keeping track of the previous/next documents that have a value and using exponential search (a variant of binary search) to seek forward.

            Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long?

            I computed

            (${new response time} - ${old response time}) / ${old response time}

            so it means more than 3x slower actually. However this does not include loading stored fields, just computing the top document by calling IndexSearcher.search(query, 1, sort).

            How fast was this query any way?

            Here are the times it take to run these queries on trunk (in milliseconds).

            Field sort time on a MatchAllDocsQuery sort time on a term query that matches 10% of docs sort time on a term query that matches 1% of docs sort time on a term query that matches docs that have the field
            cc2 128 20 2 6
            admin4 122 19 3 7
            admin3 116 18 3 17
            admin2 121 19 3 78

            I've put the response times that we are making significantly slower in red and those that we are making faster in green. So as you can see, the queries that we are speeding up are among the slower ones, and those that we are slowing down are among the faster ones.

            jpountz Adrien Grand added a comment - Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688 ? I just read the comments there and it seems promising for very sparse data. I took this approach because I saw a couple of setups that had lots of fields, many of those being sparse and the fact that sparse fields require as much resources as the dense ones is a bit frustrating. The issue description focuses on disk usage, but I think memory usage is even more important. Obviously I had to increase memory usage (since numerics don't require any memory at all today) but I wanted to keep it very low. For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache. In contrast, the patch keeps memory usage very low and keeps the access pattern sequential by keeping track of the previous/next documents that have a value and using exponential search (a variant of binary search) to seek forward. Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long? I computed (${new response time} - ${old response time}) / ${old response time} so it means more than 3x slower actually. However this does not include loading stored fields, just computing the top document by calling IndexSearcher.search(query, 1, sort). How fast was this query any way? Here are the times it take to run these queries on trunk (in milliseconds). Field sort time on a MatchAllDocsQuery sort time on a term query that matches 10% of docs sort time on a term query that matches 1% of docs sort time on a term query that matches docs that have the field cc2 128 20 2 6 admin4 122 19 3 7 admin3 116 18 3 17 admin2 121 19 3 78 I've put the response times that we are making significantly slower in red and those that we are making faster in green. So as you can see, the queries that we are speeding up are among the slower ones, and those that we are slowing down are among the faster ones.
            dsmiley David Smiley added a comment -

            For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache.

            If it's sparse (thus small) and the user has chosen to use docValues because it's going to be sorted/faceted/etc (thus it's likely 'hot' i.e. used) then I think it's reasonable to expect it'll be in the filesystem cache?

            Nonetheless what you've done here is good. I think that an in-memory hash would be a nice alternative for those that elect for the trade-off. On a semi-related note, I wonder how well a 4-byte int FST -> long would perform.

            Here are the times it take to run these queries on trunk (in milliseconds).

            Ah; looks like a fair trade-off to me. Thanks for these details.

            dsmiley David Smiley added a comment - For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache. If it's sparse (thus small) and the user has chosen to use docValues because it's going to be sorted/faceted/etc (thus it's likely 'hot' i.e. used) then I think it's reasonable to expect it'll be in the filesystem cache? Nonetheless what you've done here is good. I think that an in-memory hash would be a nice alternative for those that elect for the trade-off. On a semi-related note, I wonder how well a 4-byte int FST -> long would perform. Here are the times it take to run these queries on trunk (in milliseconds). Ah; looks like a fair trade-off to me. Thanks for these details.
            erickerickson Erick Erickson added a comment -

            Bypassing whether one should supercede the other, but linking so they can be found more easily

            erickerickson Erick Erickson added a comment - Bypassing whether one should supercede the other, but linking so they can be found more easily
            jpountz Adrien Grand added a comment -

            Updated patch that:

            • makes the code a bit more readable and adds comments
            • avoids loading a slice for values when only docs with field are requested
            • saves some monotonic lookups

            Here is an updated result of the benchmark (still with a threshold of 5% for benchmarking purposes, even though the patch still has a threshold of 1%), computed exactly the same way as above. It makes the slowdown a bit more contained. Times are in ms.

            Field sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field
            cc2 128→99 (-23%) 21.8→23.8 (+9%) 2.92→4.33 (+48%) 6.84→13.0 (+90%)
            admin4 121→98 (-19%) 21.4→21.1 (-1%) 3.65→2.81 (-23%) 8.36→16.6 (+98%)
            admin3 116→125 (+1%) 20.6→20.0 (-3%) 3.20→3.24 (+1%) 18.9→19.4 (+8%)
            admin2 124→132 (+6%) 21.5→20.6 (-4%) 3.30→3.49 (+6%) 8.58→8.64 (+1%)

            I think the change is good to go, but I know this can be controversial. Please let me know if you have concerns, otherwise I plan to commit it by the end of the week.

            jpountz Adrien Grand added a comment - Updated patch that: makes the code a bit more readable and adds comments avoids loading a slice for values when only docs with field are requested saves some monotonic lookups Here is an updated result of the benchmark (still with a threshold of 5% for benchmarking purposes, even though the patch still has a threshold of 1%), computed exactly the same way as above. It makes the slowdown a bit more contained. Times are in ms. Field sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field cc2 128→99 ( -23% ) 21.8→23.8 (+9%) 2.92→4.33 ( +48% ) 6.84→13.0 ( +90% ) admin4 121→98 ( -19% ) 21.4→21.1 (-1%) 3.65→2.81 ( -23% ) 8.36→16.6 ( +98% ) admin3 116→125 (+1%) 20.6→20.0 (-3%) 3.20→3.24 (+1%) 18.9→19.4 (+8%) admin2 124→132 (+6%) 21.5→20.6 (-4%) 3.30→3.49 (+6%) 8.58→8.64 (+1%) I think the change is good to go, but I know this can be controversial. Please let me know if you have concerns, otherwise I plan to commit it by the end of the week.

            Commit 1712957 from jpountz in branch 'dev/trunk'
            [ https://svn.apache.org/r1712957 ]

            LUCENE-6863: Optimized storage requirements of doc values fields when less than 1% of documents have a value.

            jira-bot ASF subversion and git services added a comment - Commit 1712957 from jpountz in branch 'dev/trunk' [ https://svn.apache.org/r1712957 ] LUCENE-6863 : Optimized storage requirements of doc values fields when less than 1% of documents have a value.

            Commit 1712973 from jpountz in branch 'dev/branches/branch_5x'
            [ https://svn.apache.org/r1712973 ]

            LUCENE-6863: Optimized storage requirements of doc values fields when less than 1% of documents have a value.

            jira-bot ASF subversion and git services added a comment - Commit 1712973 from jpountz in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1712973 ] LUCENE-6863 : Optimized storage requirements of doc values fields when less than 1% of documents have a value.
            Kayne Kayne Wu added a comment -

            why this Improvement is not commit in lucene 7.x and 8.x? 
            Is there some negative impact for lucene?

            Kayne Kayne Wu added a comment - why this Improvement is not commit in lucene 7.x and 8.x?  Is there some negative impact for lucene?
            dsmiley David Smiley added a comment -

            It is in every version of Lucene since 5.4.  I suspect what gave rise to this confusion is that the "Fix Version" has two versions.  If it only had 5.4 then I think you wouldn't of asked?  FWIW I always use one version when resolving my issues since I find doing otherwise to be confusing and redundant.

            dsmiley David Smiley added a comment - It is in every version of Lucene since 5.4.  I suspect what gave rise to this confusion is that the "Fix Version" has two versions.  If it only had 5.4 then I think you wouldn't of asked?  FWIW I always use one version when resolving my issues since I find doing otherwise to be confusing and redundant.
            tomoko Tomoko Uchida added a comment -

            This issue was moved to GitHub issue: #7921.

            tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #7921 .

            People

              jpountz Adrien Grand
              jpountz Adrien Grand
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: