Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      For both NUMERIC fields and ordinals of SORTED fields, we store data in a dense way. As a consequence, if you have only 1000 documents out of 1B that have a value, and 8 bits are required to store those 1000 numbers, we will not require 1KB of storage, but 1GB.

      I suspect this mostly happens in abuse cases, but still it's a pity that we explode storage requirements. We could try to detect sparsity and compress accordingly.

      1. LUCENE-6863.patch
        38 kB
        Adrien Grand
      2. LUCENE-6863.patch
        35 kB
        Adrien Grand
      3. LUCENE-6863.patch
        33 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment -

          Here is a patch. It detects sparsity by checking if less than 10% documents have a value. When this happens, it stores doc IDs that have a value using a DirectMonotonicWriter and non-missing values as a regular numeric field (so values will still be compressed with table/gcd compression if applicable).

          I first wanted to update the format javadocs but noticed that the low-level javadocs were very outdated (still documenting that sorted doc values were stored in a FST) so I decided to remove the low-level description in favour of the high-level one, which is much more interesting in my opinion.

          This is just a first iteration, tests pass but maybe the heuristic needs to be better thought. I will also do some benchmarking.

          Show
          Adrien Grand added a comment - Here is a patch. It detects sparsity by checking if less than 10% documents have a value. When this happens, it stores doc IDs that have a value using a DirectMonotonicWriter and non-missing values as a regular numeric field (so values will still be compressed with table/gcd compression if applicable). I first wanted to update the format javadocs but noticed that the low-level javadocs were very outdated (still documenting that sorted doc values were stored in a FST) so I decided to remove the low-level description in favour of the high-level one, which is much more interesting in my opinion. This is just a first iteration, tests pass but maybe the heuristic needs to be better thought. I will also do some benchmarking.
          Hide
          Yonik Seeley added a comment -

          +1 for sparse support! Sparse fields are a fact of life for a lot of users. Sometimes they can encode the info into a common field, but it's often not straight-forward to do so.

          Show
          Yonik Seeley added a comment - +1 for sparse support! Sparse fields are a fact of life for a lot of users. Sometimes they can encode the info into a common field, but it's often not straight-forward to do so.
          Hide
          Robert Muir added a comment -

          Here is a patch. It detects sparsity by checking if less than 10% documents have a value.

          Please, put a minimum count like 1024 where this starts to make sense, so types arent flipping back and forth during ordinary NRT search. And where did 10% come from? It seems far too aggressive. In previous work we determined much lower thresholds to work better (e.g. 3%).

          Show
          Robert Muir added a comment - Here is a patch. It detects sparsity by checking if less than 10% documents have a value. Please, put a minimum count like 1024 where this starts to make sense, so types arent flipping back and forth during ordinary NRT search. And where did 10% come from? It seems far too aggressive. In previous work we determined much lower thresholds to work better (e.g. 3%).
          Hide
          Robert Muir added a comment -

          I think this needs to be a very low number, if its going to be "automatic", like 3%, pure abuse case detection, just like how sparse norms worked.

          Otherwise, it needs to be a separate format in codecs/ that is explicit opt-in by the user.

          Show
          Robert Muir added a comment - I think this needs to be a very low number, if its going to be "automatic", like 3%, pure abuse case detection, just like how sparse norms worked. Otherwise, it needs to be a separate format in codecs/ that is explicit opt-in by the user.
          Hide
          Adrien Grand added a comment -

          And where did 10% come from? It seems far too aggressive.

          It was just a start so that I could easily trigger usage of this compression in tests. I agree this is very likely too aggressive, this is why I said the heuristic needs to be better thought. I will also do some benchmarking, if this ends up being much slower than regular (delta) compression, we might want to have an even lower threshold.

          Show
          Adrien Grand added a comment - And where did 10% come from? It seems far too aggressive. It was just a start so that I could easily trigger usage of this compression in tests. I agree this is very likely too aggressive, this is why I said the heuristic needs to be better thought. I will also do some benchmarking, if this ends up being much slower than regular (delta) compression, we might want to have an even lower threshold.
          Hide
          Yonik Seeley added a comment -

          It was just a start so that I could easily trigger usage of this compression in tests.

          Adding a small constant always helps with test coverage... (n*0.05 + 5) for example.

          Show
          Yonik Seeley added a comment - It was just a start so that I could easily trigger usage of this compression in tests. Adding a small constant always helps with test coverage... (n*0.05 + 5) for example.
          Hide
          Varun Thacker added a comment -

          Hi Adrien,

          Great idea to use heuristics and use it automatically. Looking forward to the benchmarks which I never got to in LUCENE-5688

          Is it okay if we mark LUCENE-5688 and LUCENE-4921 as duplicate of this Jira or could there still be plans on having a specialized doc value format?

          Show
          Varun Thacker added a comment - Hi Adrien, Great idea to use heuristics and use it automatically. Looking forward to the benchmarks which I never got to in LUCENE-5688 Is it okay if we mark LUCENE-5688 and LUCENE-4921 as duplicate of this Jira or could there still be plans on having a specialized doc value format?
          Hide
          Adrien Grand added a comment -

          Is it okay if we mark LUCENE-5688 and LUCENE-4921 as duplicate of this Jira or could there still be plans on having a specialized doc value format?

          I think it would be a bit premature to say the other jiras are superseded by this one, it's not clear yet whether the proposed approach here is actually a better idea and/or could make it to the default codec. I suggest that we wait to see how happy benchmarks are with this patch first.

          Show
          Adrien Grand added a comment - Is it okay if we mark LUCENE-5688 and LUCENE-4921 as duplicate of this Jira or could there still be plans on having a specialized doc value format? I think it would be a bit premature to say the other jiras are superseded by this one, it's not clear yet whether the proposed approach here is actually a better idea and/or could make it to the default codec. I suggest that we wait to see how happy benchmarks are with this patch first.
          Hide
          Adrien Grand added a comment -

          I ran some benchmarks with the geoname dataset which has a few sparse fields:

          • cc2: 3.2% of documents have this field, which has 573 unique values
          • admin4: 4.3% of documents have this field, which has 102950 unique values
          • admin3: 10.2% of documents have this field, which has 73120 unique values
          • admin2: 45.3% of documents have this field, which has 30603 unique values

          First I enabled sparse compression on all fields, regardless of density to see how this compares to the delta compression that we use by default, and then ran two kinds of queries:

          • queries on a random partition of the index, which I guess would be the case when you have true sparse fields
          • a query only on documents that have a value, which I guess would be more realistic if you store several types of data in the same index that don't have the same fields
          Field disk usage for ordinals memory usage with sparse compression enabled sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field
          cc2 -88% 1680 bytes -27% +25% +58% +208%
          admin4 -86% 568 bytes -20% +7% -20% +214%
          admin3 -67% 1312 bytes +11% +57% +42% +236%
          admin2 +17% 2904 bytes +132% +275% +331% +221%

          The reduction in disk usage is significant, but so is the slowdown, especially when running a query that only matches docs that have a value. However memory usage looks acceptable to me for 10M docs.

          I couldn't test with 3% as even the rarest field is contained by 3.2% of documents, but I updated the heuristic to require at least 1024 docs in the segment (like Robert suggested) and that less than 5% of docs have a value:

          Field memory usage due to sparse compression sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field
          cc2 1680 bytes -10% +34% +62% +214%
          admin4 568 bytes -7% +20% -14% +241%
          admin3 576 bytes +9% +7% +11% +10%
          admin2 1008 bytes +1% +8% +9% +11%

          To my surprise, admin2 and admin3 were still using sparse compression on some segments. The reason is that documents with sparse values are not uniform in the dataset but rather clustered: I suspect this partially explains of the slowdown for admin2/admin3, maybe there is also hotspot not liking having more impls to deal with.

          Show
          Adrien Grand added a comment - I ran some benchmarks with the geoname dataset which has a few sparse fields: cc2: 3.2% of documents have this field, which has 573 unique values admin4: 4.3% of documents have this field, which has 102950 unique values admin3: 10.2% of documents have this field, which has 73120 unique values admin2: 45.3% of documents have this field, which has 30603 unique values First I enabled sparse compression on all fields, regardless of density to see how this compares to the delta compression that we use by default, and then ran two kinds of queries: queries on a random partition of the index, which I guess would be the case when you have true sparse fields a query only on documents that have a value, which I guess would be more realistic if you store several types of data in the same index that don't have the same fields Field disk usage for ordinals memory usage with sparse compression enabled sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field cc2 -88% 1680 bytes -27% +25% +58% +208% admin4 -86% 568 bytes -20% +7% -20% +214% admin3 -67% 1312 bytes +11% +57% +42% +236% admin2 +17% 2904 bytes +132% +275% +331% +221% The reduction in disk usage is significant, but so is the slowdown, especially when running a query that only matches docs that have a value. However memory usage looks acceptable to me for 10M docs. I couldn't test with 3% as even the rarest field is contained by 3.2% of documents, but I updated the heuristic to require at least 1024 docs in the segment (like Robert suggested) and that less than 5% of docs have a value: Field memory usage due to sparse compression sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field cc2 1680 bytes -10% +34% +62% +214% admin4 568 bytes -7% +20% -14% +241% admin3 576 bytes +9% +7% +11% +10% admin2 1008 bytes +1% +8% +9% +11% To my surprise, admin2 and admin3 were still using sparse compression on some segments. The reason is that documents with sparse values are not uniform in the dataset but rather clustered: I suspect this partially explains of the slowdown for admin2/admin3, maybe there is also hotspot not liking having more impls to deal with.
          Hide
          Adrien Grand added a comment -

          Here is an updated patch that makes the sparse impl a bit more efficient when consumed in sequential order by keeping track of the upper bound of the current window. This is what has been used in the above benchmark.

          I also updated the heuristic to require 1024 docs in the segment and that less than 1% of docs have a value in order to be on the safe side and to only slow down abuse/exceptionnal cases. Even if/when this gets used for some fields, I think the slowdown is acceptable insofar as it would only slow down fast queries: if you look at the above benchmarks, when the query matches many docs (such as a MatchAllDocsQuery) this encoding is actually faster than regular delta encoding. Only queries that match a small partition of the index (so that most dv lookups will require a binary search) would become slower.

          Opinions?

          Show
          Adrien Grand added a comment - Here is an updated patch that makes the sparse impl a bit more efficient when consumed in sequential order by keeping track of the upper bound of the current window. This is what has been used in the above benchmark. I also updated the heuristic to require 1024 docs in the segment and that less than 1% of docs have a value in order to be on the safe side and to only slow down abuse/exceptionnal cases. Even if/when this gets used for some fields, I think the slowdown is acceptable insofar as it would only slow down fast queries: if you look at the above benchmarks, when the query matches many docs (such as a MatchAllDocsQuery) this encoding is actually faster than regular delta encoding. Only queries that match a small partition of the index (so that most dv lookups will require a binary search) would become slower. Opinions?
          Hide
          David Smiley added a comment -

          Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688? I just read the comments there and it seems promising for very sparse data.

          Regarding the performance trade-off in your table – I find it hard to evaluate to consider if it's worth it. Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long? Or is this measurement isolated to... somehow just the sort part? How fast was this query any way? If we're making a 3ms query take 9ms then it wouldn't bother me so much as a 300ms query taking 900ms. Of course it depends on the amount of data.

          Show
          David Smiley added a comment - Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688 ? I just read the comments there and it seems promising for very sparse data. Regarding the performance trade-off in your table – I find it hard to evaluate to consider if it's worth it. Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long? Or is this measurement isolated to... somehow just the sort part? How fast was this query any way? If we're making a 3ms query take 9ms then it wouldn't bother me so much as a 300ms query taking 900ms. Of course it depends on the amount of data.
          Hide
          Adrien Grand added a comment -

          Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688? I just read the comments there and it seems promising for very sparse data.

          I took this approach because I saw a couple of setups that had lots of fields, many of those being sparse and the fact that sparse fields require as much resources as the dense ones is a bit frustrating. The issue description focuses on disk usage, but I think memory usage is even more important. Obviously I had to increase memory usage (since numerics don't require any memory at all today) but I wanted to keep it very low. For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache. In contrast, the patch keeps memory usage very low and keeps the access pattern sequential by keeping track of the previous/next documents that have a value and using exponential search (a variant of binary search) to seek forward.

          Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long?

          I computed

          (${new response time} - ${old response time}) / ${old response time}

          so it means more than 3x slower actually. However this does not include loading stored fields, just computing the top document by calling IndexSearcher.search(query, 1, sort).

          How fast was this query any way?

          Here are the times it take to run these queries on trunk (in milliseconds).

          Field sort time on a MatchAllDocsQuery sort time on a term query that matches 10% of docs sort time on a term query that matches 1% of docs sort time on a term query that matches docs that have the field
          cc2 128 20 2 6
          admin4 122 19 3 7
          admin3 116 18 3 17
          admin2 121 19 3 78

          I've put the response times that we are making significantly slower in red and those that we are making faster in green. So as you can see, the queries that we are speeding up are among the slower ones, and those that we are slowing down are among the faster ones.

          Show
          Adrien Grand added a comment - Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688 ? I just read the comments there and it seems promising for very sparse data. I took this approach because I saw a couple of setups that had lots of fields, many of those being sparse and the fact that sparse fields require as much resources as the dense ones is a bit frustrating. The issue description focuses on disk usage, but I think memory usage is even more important. Obviously I had to increase memory usage (since numerics don't require any memory at all today) but I wanted to keep it very low. For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache. In contrast, the patch keeps memory usage very low and keeps the access pattern sequential by keeping track of the previous/next documents that have a value and using exponential search (a variant of binary search) to seek forward. Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long? I computed (${new response time} - ${old response time}) / ${old response time} so it means more than 3x slower actually. However this does not include loading stored fields, just computing the top document by calling IndexSearcher.search(query, 1, sort). How fast was this query any way? Here are the times it take to run these queries on trunk (in milliseconds). Field sort time on a MatchAllDocsQuery sort time on a term query that matches 10% of docs sort time on a term query that matches 1% of docs sort time on a term query that matches docs that have the field cc2 128 20 2 6 admin4 122 19 3 7 admin3 116 18 3 17 admin2 121 19 3 78 I've put the response times that we are making significantly slower in red and those that we are making faster in green. So as you can see, the queries that we are speeding up are among the slower ones, and those that we are slowing down are among the faster ones.
          Hide
          David Smiley added a comment -

          For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache.

          If it's sparse (thus small) and the user has chosen to use docValues because it's going to be sorted/faceted/etc (thus it's likely 'hot' i.e. used) then I think it's reasonable to expect it'll be in the filesystem cache?

          Nonetheless what you've done here is good. I think that an in-memory hash would be a nice alternative for those that elect for the trade-off. On a semi-related note, I wonder how well a 4-byte int FST -> long would perform.

          Here are the times it take to run these queries on trunk (in milliseconds).

          Ah; looks like a fair trade-off to me. Thanks for these details.

          Show
          David Smiley added a comment - For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache. If it's sparse (thus small) and the user has chosen to use docValues because it's going to be sorted/faceted/etc (thus it's likely 'hot' i.e. used) then I think it's reasonable to expect it'll be in the filesystem cache? Nonetheless what you've done here is good. I think that an in-memory hash would be a nice alternative for those that elect for the trade-off. On a semi-related note, I wonder how well a 4-byte int FST -> long would perform. Here are the times it take to run these queries on trunk (in milliseconds). Ah; looks like a fair trade-off to me. Thanks for these details.
          Hide
          Erick Erickson added a comment -

          Bypassing whether one should supercede the other, but linking so they can be found more easily

          Show
          Erick Erickson added a comment - Bypassing whether one should supercede the other, but linking so they can be found more easily
          Hide
          Adrien Grand added a comment -

          Updated patch that:

          • makes the code a bit more readable and adds comments
          • avoids loading a slice for values when only docs with field are requested
          • saves some monotonic lookups

          Here is an updated result of the benchmark (still with a threshold of 5% for benchmarking purposes, even though the patch still has a threshold of 1%), computed exactly the same way as above. It makes the slowdown a bit more contained. Times are in ms.

          Field sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field
          cc2 128→99 (-23%) 21.8→23.8 (+9%) 2.92→4.33 (+48%) 6.84→13.0 (+90%)
          admin4 121→98 (-19%) 21.4→21.1 (-1%) 3.65→2.81 (-23%) 8.36→16.6 (+98%)
          admin3 116→125 (+1%) 20.6→20.0 (-3%) 3.20→3.24 (+1%) 18.9→19.4 (+8%)
          admin2 124→132 (+6%) 21.5→20.6 (-4%) 3.30→3.49 (+6%) 8.58→8.64 (+1%)

          I think the change is good to go, but I know this can be controversial. Please let me know if you have concerns, otherwise I plan to commit it by the end of the week.

          Show
          Adrien Grand added a comment - Updated patch that: makes the code a bit more readable and adds comments avoids loading a slice for values when only docs with field are requested saves some monotonic lookups Here is an updated result of the benchmark (still with a threshold of 5% for benchmarking purposes, even though the patch still has a threshold of 1%), computed exactly the same way as above. It makes the slowdown a bit more contained. Times are in ms. Field sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field cc2 128→99 ( -23% ) 21.8→23.8 (+9%) 2.92→4.33 ( +48% ) 6.84→13.0 ( +90% ) admin4 121→98 ( -19% ) 21.4→21.1 (-1%) 3.65→2.81 ( -23% ) 8.36→16.6 ( +98% ) admin3 116→125 (+1%) 20.6→20.0 (-3%) 3.20→3.24 (+1%) 18.9→19.4 (+8%) admin2 124→132 (+6%) 21.5→20.6 (-4%) 3.30→3.49 (+6%) 8.58→8.64 (+1%) I think the change is good to go, but I know this can be controversial. Please let me know if you have concerns, otherwise I plan to commit it by the end of the week.
          Hide
          ASF subversion and git services added a comment -

          Commit 1712957 from Adrien Grand in branch 'dev/trunk'
          [ https://svn.apache.org/r1712957 ]

          LUCENE-6863: Optimized storage requirements of doc values fields when less than 1% of documents have a value.

          Show
          ASF subversion and git services added a comment - Commit 1712957 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1712957 ] LUCENE-6863 : Optimized storage requirements of doc values fields when less than 1% of documents have a value.
          Hide
          ASF subversion and git services added a comment -

          Commit 1712973 from Adrien Grand in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1712973 ]

          LUCENE-6863: Optimized storage requirements of doc values fields when less than 1% of documents have a value.

          Show
          ASF subversion and git services added a comment - Commit 1712973 from Adrien Grand in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1712973 ] LUCENE-6863 : Optimized storage requirements of doc values fields when less than 1% of documents have a value.

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development