[LUCENE-6863] Store sparse doc values more efficiently - ASF JIRA

Adrien Grand added a comment - 28/Oct/15 22:25

Here is a patch. It detects sparsity by checking if less than 10% documents have a value. When this happens, it stores doc IDs that have a value using a DirectMonotonicWriter and non-missing values as a regular numeric field (so values will still be compressed with table/gcd compression if applicable).

I first wanted to update the format javadocs but noticed that the low-level javadocs were very outdated (still documenting that sorted doc values were stored in a FST) so I decided to remove the low-level description in favour of the high-level one, which is much more interesting in my opinion.

This is just a first iteration, tests pass but maybe the heuristic needs to be better thought. I will also do some benchmarking.

Adrien Grand added a comment - 28/Oct/15 22:25 Here is a patch. It detects sparsity by checking if less than 10% documents have a value. When this happens, it stores doc IDs that have a value using a DirectMonotonicWriter and non-missing values as a regular numeric field (so values will still be compressed with table/gcd compression if applicable). I first wanted to update the format javadocs but noticed that the low-level javadocs were very outdated (still documenting that sorted doc values were stored in a FST) so I decided to remove the low-level description in favour of the high-level one, which is much more interesting in my opinion. This is just a first iteration, tests pass but maybe the heuristic needs to be better thought. I will also do some benchmarking.

Yonik Seeley added a comment - 28/Oct/15 23:24

+1 for sparse support! Sparse fields are a fact of life for a lot of users. Sometimes they can encode the info into a common field, but it's often not straight-forward to do so.

Yonik Seeley added a comment - 28/Oct/15 23:24 +1 for sparse support! Sparse fields are a fact of life for a lot of users. Sometimes they can encode the info into a common field, but it's often not straight-forward to do so.

Robert Muir added a comment - 29/Oct/15 00:04

Here is a patch. It detects sparsity by checking if less than 10% documents have a value.

Please, put a minimum count like 1024 where this starts to make sense, so types arent flipping back and forth during ordinary NRT search. And where did 10% come from? It seems far too aggressive. In previous work we determined much lower thresholds to work better (e.g. 3%).

Robert Muir added a comment - 29/Oct/15 00:04 Here is a patch. It detects sparsity by checking if less than 10% documents have a value. Please, put a minimum count like 1024 where this starts to make sense, so types arent flipping back and forth during ordinary NRT search. And where did 10% come from? It seems far too aggressive. In previous work we determined much lower thresholds to work better (e.g. 3%).

Robert Muir added a comment - 29/Oct/15 00:08

I think this needs to be a very low number, if its going to be "automatic", like 3%, pure abuse case detection, just like how sparse norms worked.

Otherwise, it needs to be a separate format in codecs/ that is explicit opt-in by the user.

Robert Muir added a comment - 29/Oct/15 00:08 I think this needs to be a very low number, if its going to be "automatic", like 3%, pure abuse case detection, just like how sparse norms worked. Otherwise, it needs to be a separate format in codecs/ that is explicit opt-in by the user.

Adrien Grand added a comment - 29/Oct/15 00:32

And where did 10% come from? It seems far too aggressive.

It was just a start so that I could easily trigger usage of this compression in tests. I agree this is very likely too aggressive, this is why I said the heuristic needs to be better thought. I will also do some benchmarking, if this ends up being much slower than regular (delta) compression, we might want to have an even lower threshold.

Adrien Grand added a comment - 29/Oct/15 00:32 And where did 10% come from? It seems far too aggressive. It was just a start so that I could easily trigger usage of this compression in tests. I agree this is very likely too aggressive, this is why I said the heuristic needs to be better thought. I will also do some benchmarking, if this ends up being much slower than regular (delta) compression, we might want to have an even lower threshold.

Yonik Seeley added a comment - 29/Oct/15 01:00

It was just a start so that I could easily trigger usage of this compression in tests.

Adding a small constant always helps with test coverage... (n*0.05 + 5) for example.

Yonik Seeley added a comment - 29/Oct/15 01:00 It was just a start so that I could easily trigger usage of this compression in tests. Adding a small constant always helps with test coverage... (n*0.05 + 5) for example.

Varun Thacker added a comment - 29/Oct/15 05:50

Hi Adrien,

Great idea to use heuristics and use it automatically. Looking forward to the benchmarks which I never got to in ~~LUCENE-5688~~

Is it okay if we mark ~~LUCENE-5688~~ and ~~LUCENE-4921~~ as duplicate of this Jira or could there still be plans on having a specialized doc value format?

Varun Thacker added a comment - 29/Oct/15 05:50 Hi Adrien, Great idea to use heuristics and use it automatically. Looking forward to the benchmarks which I never got to in LUCENE-5688 Is it okay if we mark LUCENE-5688 and LUCENE-4921 as duplicate of this Jira or could there still be plans on having a specialized doc value format?

Adrien Grand added a comment - 29/Oct/15 13:30

Is it okay if we mark ~~LUCENE-5688~~ and ~~LUCENE-4921~~ as duplicate of this Jira or could there still be plans on having a specialized doc value format?

I think it would be a bit premature to say the other jiras are superseded by this one, it's not clear yet whether the proposed approach here is actually a better idea and/or could make it to the default codec. I suggest that we wait to see how happy benchmarks are with this patch first.

Adrien Grand added a comment - 29/Oct/15 13:30 Is it okay if we mark LUCENE-5688 and LUCENE-4921 as duplicate of this Jira or could there still be plans on having a specialized doc value format? I think it would be a bit premature to say the other jiras are superseded by this one, it's not clear yet whether the proposed approach here is actually a better idea and/or could make it to the default codec. I suggest that we wait to see how happy benchmarks are with this patch first.

Adrien Grand added a comment - 29/Oct/15 16:06

I ran some benchmarks with the geoname dataset which has a few sparse fields:

cc2: 3.2% of documents have this field, which has 573 unique values
admin4: 4.3% of documents have this field, which has 102950 unique values
admin3: 10.2% of documents have this field, which has 73120 unique values
admin2: 45.3% of documents have this field, which has 30603 unique values

First I enabled sparse compression on all fields, regardless of density to see how this compares to the delta compression that we use by default, and then ran two kinds of queries:

queries on a random partition of the index, which I guess would be the case when you have true sparse fields
a query only on documents that have a value, which I guess would be more realistic if you store several types of data in the same index that don't have the same fields

Field	disk usage for ordinals	memory usage with sparse compression enabled	sort performance on a MatchAllDocsQuery	sort performance on a term query that matches 10% of docs	sort performance on a term query that matches 1% of docs	sort performance on a term query that matches docs that have the field
cc2	-88%	1680 bytes	-27%	+25%	+58%	+208%
admin4	-86%	568 bytes	-20%	+7%	-20%	+214%
admin3	-67%	1312 bytes	+11%	+57%	+42%	+236%
admin2	+17%	2904 bytes	+132%	+275%	+331%	+221%

The reduction in disk usage is significant, but so is the slowdown, especially when running a query that only matches docs that have a value. However memory usage looks acceptable to me for 10M docs.

I couldn't test with 3% as even the rarest field is contained by 3.2% of documents, but I updated the heuristic to require at least 1024 docs in the segment (like Robert suggested) and that less than 5% of docs have a value:

Field	memory usage due to sparse compression	sort performance on a MatchAllDocsQuery	sort performance on a term query that matches 10% of docs	sort performance on a term query that matches 1% of docs	sort performance on a term query that matches docs that have the field
cc2	1680 bytes	-10%	+34%	+62%	+214%
admin4	568 bytes	-7%	+20%	-14%	+241%
admin3	576 bytes	+9%	+7%	+11%	+10%
admin2	1008 bytes	+1%	+8%	+9%	+11%

To my surprise, admin2 and admin3 were still using sparse compression on some segments. The reason is that documents with sparse values are not uniform in the dataset but rather clustered: I suspect this partially explains of the slowdown for admin2/admin3, maybe there is also hotspot not liking having more impls to deal with.

Adrien Grand added a comment - 29/Oct/15 16:06 I ran some benchmarks with the geoname dataset which has a few sparse fields: cc2: 3.2% of documents have this field, which has 573 unique values admin4: 4.3% of documents have this field, which has 102950 unique values admin3: 10.2% of documents have this field, which has 73120 unique values admin2: 45.3% of documents have this field, which has 30603 unique values First I enabled sparse compression on all fields, regardless of density to see how this compares to the delta compression that we use by default, and then ran two kinds of queries: queries on a random partition of the index, which I guess would be the case when you have true sparse fields a query only on documents that have a value, which I guess would be more realistic if you store several types of data in the same index that don't have the same fields Field disk usage for ordinals memory usage with sparse compression enabled sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field cc2 -88% 1680 bytes -27% +25% +58% +208% admin4 -86% 568 bytes -20% +7% -20% +214% admin3 -67% 1312 bytes +11% +57% +42% +236% admin2 +17% 2904 bytes +132% +275% +331% +221% The reduction in disk usage is significant, but so is the slowdown, especially when running a query that only matches docs that have a value. However memory usage looks acceptable to me for 10M docs. I couldn't test with 3% as even the rarest field is contained by 3.2% of documents, but I updated the heuristic to require at least 1024 docs in the segment (like Robert suggested) and that less than 5% of docs have a value: Field memory usage due to sparse compression sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field cc2 1680 bytes -10% +34% +62% +214% admin4 568 bytes -7% +20% -14% +241% admin3 576 bytes +9% +7% +11% +10% admin2 1008 bytes +1% +8% +9% +11% To my surprise, admin2 and admin3 were still using sparse compression on some segments. The reason is that documents with sparse values are not uniform in the dataset but rather clustered: I suspect this partially explains of the slowdown for admin2/admin3, maybe there is also hotspot not liking having more impls to deal with.

Adrien Grand added a comment - 29/Oct/15 16:47

Here is an updated patch that makes the sparse impl a bit more efficient when consumed in sequential order by keeping track of the upper bound of the current window. This is what has been used in the above benchmark.

I also updated the heuristic to require 1024 docs in the segment and that less than 1% of docs have a value in order to be on the safe side and to only slow down abuse/exceptionnal cases. Even if/when this gets used for some fields, I think the slowdown is acceptable insofar as it would only slow down fast queries: if you look at the above benchmarks, when the query matches many docs (such as a MatchAllDocsQuery) this encoding is actually faster than regular delta encoding. Only queries that match a small partition of the index (so that most dv lookups will require a binary search) would become slower.

Opinions?

Adrien Grand added a comment - 29/Oct/15 16:47 Here is an updated patch that makes the sparse impl a bit more efficient when consumed in sequential order by keeping track of the upper bound of the current window. This is what has been used in the above benchmark. I also updated the heuristic to require 1024 docs in the segment and that less than 1% of docs have a value in order to be on the safe side and to only slow down abuse/exceptionnal cases. Even if/when this gets used for some fields, I think the slowdown is acceptable insofar as it would only slow down fast queries: if you look at the above benchmarks, when the query matches many docs (such as a MatchAllDocsQuery) this encoding is actually faster than regular delta encoding. Only queries that match a small partition of the index (so that most dv lookups will require a binary search) would become slower. Opinions?

David Smiley added a comment - 30/Oct/15 14:07

Did you consider a hash lookup instead of binary-search, as was done in ~~LUCENE-5688~~? I just read the comments there and it seems promising for very sparse data.

Regarding the performance trade-off in your table – I find it hard to evaluate to consider if it's worth it. Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long? Or is this measurement isolated to... somehow just the sort part? How fast was this query any way? If we're making a 3ms query take 9ms then it wouldn't bother me so much as a 300ms query taking 900ms. Of course it depends on the amount of data.

David Smiley added a comment - 30/Oct/15 14:07 Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688 ? I just read the comments there and it seems promising for very sparse data. Regarding the performance trade-off in your table – I find it hard to evaluate to consider if it's worth it. Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long? Or is this measurement isolated to... somehow just the sort part? How fast was this query any way? If we're making a 3ms query take 9ms then it wouldn't bother me so much as a 300ms query taking 900ms. Of course it depends on the amount of data.

Adrien Grand added a comment - 30/Oct/15 16:36

Did you consider a hash lookup instead of binary-search, as was done in ~~LUCENE-5688~~? I just read the comments there and it seems promising for very sparse data.

I took this approach because I saw a couple of setups that had lots of fields, many of those being sparse and the fact that sparse fields require as much resources as the dense ones is a bit frustrating. The issue description focuses on disk usage, but I think memory usage is even more important. Obviously I had to increase memory usage (since numerics don't require any memory at all today) but I wanted to keep it very low. For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache. In contrast, the patch keeps memory usage very low and keeps the access pattern sequential by keeping track of the previous/next documents that have a value and using exponential search (a variant of binary search) to seek forward.

Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long?

I computed

(${new response time} - ${old response time}) / ${old response time}

so it means more than 3x slower actually. However this does not include loading stored fields, just computing the top document by calling IndexSearcher.search(query, 1, sort).

How fast was this query any way?

Here are the times it take to run these queries on trunk (in milliseconds).

Field	sort time on a MatchAllDocsQuery	sort time on a term query that matches 10% of docs	sort time on a term query that matches 1% of docs	sort time on a term query that matches docs that have the field
cc2	128	20	2	6
admin4	122	19	3	7
admin3	116	18	3	17
admin2	121	19	3	78

I've put the response times that we are making significantly slower in red and those that we are making faster in green. So as you can see, the queries that we are speeding up are among the slower ones, and those that we are slowing down are among the faster ones.

Adrien Grand added a comment - 30/Oct/15 16:36 Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688 ? I just read the comments there and it seems promising for very sparse data. I took this approach because I saw a couple of setups that had lots of fields, many of those being sparse and the fact that sparse fields require as much resources as the dense ones is a bit frustrating. The issue description focuses on disk usage, but I think memory usage is even more important. Obviously I had to increase memory usage (since numerics don't require any memory at all today) but I wanted to keep it very low. For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache. In contrast, the patch keeps memory usage very low and keeps the access pattern sequential by keeping track of the previous/next documents that have a value and using exponential search (a variant of binary search) to seek forward. Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long? I computed (${new response time} - ${old response time}) / ${old response time} so it means more than 3x slower actually. However this does not include loading stored fields, just computing the top document by calling IndexSearcher.search(query, 1, sort). How fast was this query any way? Here are the times it take to run these queries on trunk (in milliseconds). Field sort time on a MatchAllDocsQuery sort time on a term query that matches 10% of docs sort time on a term query that matches 1% of docs sort time on a term query that matches docs that have the field cc2 128 20 2 6 admin4 122 19 3 7 admin3 116 18 3 17 admin2 121 19 3 78 I've put the response times that we are making significantly slower in red and those that we are making faster in green. So as you can see, the queries that we are speeding up are among the slower ones, and those that we are slowing down are among the faster ones.

David Smiley added a comment - 30/Oct/15 17:21

For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache.

If it's sparse (thus small) and the user has chosen to use docValues because it's going to be sorted/faceted/etc (thus it's likely 'hot' i.e. used) then I think it's reasonable to expect it'll be in the filesystem cache?

Nonetheless what you've done here is good. I think that an in-memory hash would be a nice alternative for those that elect for the trade-off. On a semi-related note, I wonder how well a 4-byte int FST -> long would perform.

Here are the times it take to run these queries on trunk (in milliseconds).

Ah; looks like a fair trade-off to me. Thanks for these details.

David Smiley added a comment - 30/Oct/15 17:21 For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache. If it's sparse (thus small) and the user has chosen to use docValues because it's going to be sorted/faceted/etc (thus it's likely 'hot' i.e. used) then I think it's reasonable to expect it'll be in the filesystem cache? Nonetheless what you've done here is good. I think that an in-memory hash would be a nice alternative for those that elect for the trade-off. On a semi-related note, I wonder how well a 4-byte int FST -> long would perform. Here are the times it take to run these queries on trunk (in milliseconds). Ah; looks like a fair trade-off to me. Thanks for these details.

Erick Erickson added a comment - 30/Oct/15 18:32

Bypassing whether one should supercede the other, but linking so they can be found more easily

Erick Erickson added a comment - 30/Oct/15 18:32 Bypassing whether one should supercede the other, but linking so they can be found more easily

Adrien Grand added a comment - 02/Nov/15 17:45

Updated patch that:

makes the code a bit more readable and adds comments
avoids loading a slice for values when only docs with field are requested
saves some monotonic lookups

Here is an updated result of the benchmark (still with a threshold of 5% for benchmarking purposes, even though the patch still has a threshold of 1%), computed exactly the same way as above. It makes the slowdown a bit more contained. Times are in ms.

Field	sort performance on a MatchAllDocsQuery	sort performance on a term query that matches 10% of docs	sort performance on a term query that matches 1% of docs	sort performance on a term query that matches docs that have the field
cc2	128→99 (-23%)	21.8→23.8 (+9%)	2.92→4.33 (+48%)	6.84→13.0 (+90%)
admin4	121→98 (-19%)	21.4→21.1 (-1%)	3.65→2.81 (-23%)	8.36→16.6 (+98%)
admin3	116→125 (+1%)	20.6→20.0 (-3%)	3.20→3.24 (+1%)	18.9→19.4 (+8%)
admin2	124→132 (+6%)	21.5→20.6 (-4%)	3.30→3.49 (+6%)	8.58→8.64 (+1%)

I think the change is good to go, but I know this can be controversial. Please let me know if you have concerns, otherwise I plan to commit it by the end of the week.

Adrien Grand added a comment - 02/Nov/15 17:45 Updated patch that: makes the code a bit more readable and adds comments avoids loading a slice for values when only docs with field are requested saves some monotonic lookups Here is an updated result of the benchmark (still with a threshold of 5% for benchmarking purposes, even though the patch still has a threshold of 1%), computed exactly the same way as above. It makes the slowdown a bit more contained. Times are in ms. Field sort performance on a MatchAllDocsQuery sort performance on a term query that matches 10% of docs sort performance on a term query that matches 1% of docs sort performance on a term query that matches docs that have the field cc2 128→99 ( -23% ) 21.8→23.8 (+9%) 2.92→4.33 ( +48% ) 6.84→13.0 ( +90% ) admin4 121→98 ( -19% ) 21.4→21.1 (-1%) 3.65→2.81 ( -23% ) 8.36→16.6 ( +98% ) admin3 116→125 (+1%) 20.6→20.0 (-3%) 3.20→3.24 (+1%) 18.9→19.4 (+8%) admin2 124→132 (+6%) 21.5→20.6 (-4%) 3.30→3.49 (+6%) 8.58→8.64 (+1%) I think the change is good to go, but I know this can be controversial. Please let me know if you have concerns, otherwise I plan to commit it by the end of the week.

ASF subversion and git services added a comment - 06/Nov/15 13:04

Commit 1712957 from jpountz in branch 'dev/trunk'
[ https://svn.apache.org/r1712957 ]

~~LUCENE-6863~~: Optimized storage requirements of doc values fields when less than 1% of documents have a value.

ASF subversion and git services added a comment - 06/Nov/15 13:04 Commit 1712957 from jpountz in branch 'dev/trunk' [ https://svn.apache.org/r1712957 ] LUCENE-6863 : Optimized storage requirements of doc values fields when less than 1% of documents have a value.

ASF subversion and git services added a comment - 06/Nov/15 14:58

Commit 1712973 from jpountz in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1712973 ]

~~LUCENE-6863~~: Optimized storage requirements of doc values fields when less than 1% of documents have a value.

ASF subversion and git services added a comment - 06/Nov/15 14:58 Commit 1712973 from jpountz in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1712973 ] LUCENE-6863 : Optimized storage requirements of doc values fields when less than 1% of documents have a value.

Kayne Wu added a comment - 30/Mar/19 08:16

why this Improvement is not commit in lucene 7.x and 8.x?
Is there some negative impact for lucene？

Kayne Wu added a comment - 30/Mar/19 08:16 why this Improvement is not commit in lucene 7.x and 8.x? Is there some negative impact for lucene？

David Smiley added a comment - 31/Mar/19 20:29

It is in every version of Lucene since 5.4. I suspect what gave rise to this confusion is that the "Fix Version" has two versions. If it only had 5.4 then I think you wouldn't of asked? FWIW I always use one version when resolving my issues since I find doing otherwise to be confusing and redundant.

David Smiley added a comment - 31/Mar/19 20:29 It is in every version of Lucene since 5.4. I suspect what gave rise to this confusion is that the "Fix Version" has two versions. If it only had 5.4 then I think you wouldn't of asked? FWIW I always use one version when resolving my issues since I find doing otherwise to be confusing and redundant.

Tomoko Uchida added a comment - 28/Aug/22 14:44

This issue was moved to GitHub issue: #7921.

Tomoko Uchida added a comment - 28/Aug/22 14:44 This issue was moved to GitHub issue: #7921 .

Lucene - Core

Store sparse doc values more efficiently

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates