Description
For both NUMERIC fields and ordinals of SORTED fields, we store data in a dense way. As a consequence, if you have only 1000 documents out of 1B that have a value, and 8 bits are required to store those 1000 numbers, we will not require 1KB of storage, but 1GB.
I suspect this mostly happens in abuse cases, but still it's a pity that we explode storage requirements. We could try to detect sparsity and compress accordingly.
Attachments
Attachments
- LUCENE-6863.patch
- 38 kB
- Adrien Grand
- LUCENE-6863.patch
- 35 kB
- Adrien Grand
- LUCENE-6863.patch
- 33 kB
- Adrien Grand
Issue Links
- supercedes
-
LUCENE-5688 NumericDocValues fields with sparse data can be compressed better
- Resolved
-
LUCENE-4921 Create a DocValuesFormat for sparse doc values
- Closed
Activity
+1 for sparse support! Sparse fields are a fact of life for a lot of users. Sometimes they can encode the info into a common field, but it's often not straight-forward to do so.
Here is a patch. It detects sparsity by checking if less than 10% documents have a value.
Please, put a minimum count like 1024 where this starts to make sense, so types arent flipping back and forth during ordinary NRT search. And where did 10% come from? It seems far too aggressive. In previous work we determined much lower thresholds to work better (e.g. 3%).
I think this needs to be a very low number, if its going to be "automatic", like 3%, pure abuse case detection, just like how sparse norms worked.
Otherwise, it needs to be a separate format in codecs/ that is explicit opt-in by the user.
And where did 10% come from? It seems far too aggressive.
It was just a start so that I could easily trigger usage of this compression in tests. I agree this is very likely too aggressive, this is why I said the heuristic needs to be better thought. I will also do some benchmarking, if this ends up being much slower than regular (delta) compression, we might want to have an even lower threshold.
It was just a start so that I could easily trigger usage of this compression in tests.
Adding a small constant always helps with test coverage... (n*0.05 + 5) for example.
Hi Adrien,
Great idea to use heuristics and use it automatically. Looking forward to the benchmarks which I never got to in LUCENE-5688
Is it okay if we mark LUCENE-5688 and LUCENE-4921 as duplicate of this Jira or could there still be plans on having a specialized doc value format?
Is it okay if we mark
LUCENE-5688andLUCENE-4921as duplicate of this Jira or could there still be plans on having a specialized doc value format?
I think it would be a bit premature to say the other jiras are superseded by this one, it's not clear yet whether the proposed approach here is actually a better idea and/or could make it to the default codec. I suggest that we wait to see how happy benchmarks are with this patch first.
I ran some benchmarks with the geoname dataset which has a few sparse fields:
- cc2: 3.2% of documents have this field, which has 573 unique values
- admin4: 4.3% of documents have this field, which has 102950 unique values
- admin3: 10.2% of documents have this field, which has 73120 unique values
- admin2: 45.3% of documents have this field, which has 30603 unique values
First I enabled sparse compression on all fields, regardless of density to see how this compares to the delta compression that we use by default, and then ran two kinds of queries:
- queries on a random partition of the index, which I guess would be the case when you have true sparse fields
- a query only on documents that have a value, which I guess would be more realistic if you store several types of data in the same index that don't have the same fields
Field | disk usage for ordinals | memory usage with sparse compression enabled | sort performance on a MatchAllDocsQuery | sort performance on a term query that matches 10% of docs | sort performance on a term query that matches 1% of docs | sort performance on a term query that matches docs that have the field |
---|---|---|---|---|---|---|
cc2 | -88% | 1680 bytes | -27% | +25% | +58% | +208% |
admin4 | -86% | 568 bytes | -20% | +7% | -20% | +214% |
admin3 | -67% | 1312 bytes | +11% | +57% | +42% | +236% |
admin2 | +17% | 2904 bytes | +132% | +275% | +331% | +221% |
The reduction in disk usage is significant, but so is the slowdown, especially when running a query that only matches docs that have a value. However memory usage looks acceptable to me for 10M docs.
I couldn't test with 3% as even the rarest field is contained by 3.2% of documents, but I updated the heuristic to require at least 1024 docs in the segment (like Robert suggested) and that less than 5% of docs have a value:
Field | memory usage due to sparse compression | sort performance on a MatchAllDocsQuery | sort performance on a term query that matches 10% of docs | sort performance on a term query that matches 1% of docs | sort performance on a term query that matches docs that have the field |
---|---|---|---|---|---|
cc2 | 1680 bytes | -10% | +34% | +62% | +214% |
admin4 | 568 bytes | -7% | +20% | -14% | +241% |
admin3 | 576 bytes | +9% | +7% | +11% | +10% |
admin2 | 1008 bytes | +1% | +8% | +9% | +11% |
To my surprise, admin2 and admin3 were still using sparse compression on some segments. The reason is that documents with sparse values are not uniform in the dataset but rather clustered: I suspect this partially explains of the slowdown for admin2/admin3, maybe there is also hotspot not liking having more impls to deal with.
Here is an updated patch that makes the sparse impl a bit more efficient when consumed in sequential order by keeping track of the upper bound of the current window. This is what has been used in the above benchmark.
I also updated the heuristic to require 1024 docs in the segment and that less than 1% of docs have a value in order to be on the safe side and to only slow down abuse/exceptionnal cases. Even if/when this gets used for some fields, I think the slowdown is acceptable insofar as it would only slow down fast queries: if you look at the above benchmarks, when the query matches many docs (such as a MatchAllDocsQuery) this encoding is actually faster than regular delta encoding. Only queries that match a small partition of the index (so that most dv lookups will require a binary search) would become slower.
Opinions?
Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688? I just read the comments there and it seems promising for very sparse data.
Regarding the performance trade-off in your table – I find it hard to evaluate to consider if it's worth it. Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long? Or is this measurement isolated to... somehow just the sort part? How fast was this query any way? If we're making a 3ms query take 9ms then it wouldn't bother me so much as a 300ms query taking 900ms. Of course it depends on the amount of data.
Did you consider a hash lookup instead of binary-search, as was done in
LUCENE-5688? I just read the comments there and it seems promising for very sparse data.
I took this approach because I saw a couple of setups that had lots of fields, many of those being sparse and the fact that sparse fields require as much resources as the dense ones is a bit frustrating. The issue description focuses on disk usage, but I think memory usage is even more important. Obviously I had to increase memory usage (since numerics don't require any memory at all today) but I wanted to keep it very low. For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache. In contrast, the patch keeps memory usage very low and keeps the access pattern sequential by keeping track of the previous/next documents that have a value and using exponential search (a variant of binary search) to seek forward.
Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long?
I computed
(${new response time} - ${old response time}) / ${old response time}
so it means more than 3x slower actually. However this does not include loading stored fields, just computing the top document by calling IndexSearcher.search(query, 1, sort).
How fast was this query any way?
Here are the times it take to run these queries on trunk (in milliseconds).
Field | sort time on a MatchAllDocsQuery | sort time on a term query that matches 10% of docs | sort time on a term query that matches 1% of docs | sort time on a term query that matches docs that have the field |
---|---|---|---|---|
cc2 | 128 | 20 | 2 | 6 |
admin4 | 122 | 19 | 3 | 7 |
admin3 | 116 | 18 | 3 | 17 |
admin2 | 121 | 19 | 3 | 78 |
I've put the response times that we are making significantly slower in red and those that we are making faster in green. So as you can see, the queries that we are speeding up are among the slower ones, and those that we are slowing down are among the faster ones.
For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache.
If it's sparse (thus small) and the user has chosen to use docValues because it's going to be sorted/faceted/etc (thus it's likely 'hot' i.e. used) then I think it's reasonable to expect it'll be in the filesystem cache?
Nonetheless what you've done here is good. I think that an in-memory hash would be a nice alternative for those that elect for the trade-off. On a semi-related note, I wonder how well a 4-byte int FST -> long would perform.
Here are the times it take to run these queries on trunk (in milliseconds).
Ah; looks like a fair trade-off to me. Thanks for these details.
Bypassing whether one should supercede the other, but linking so they can be found more easily
Updated patch that:
- makes the code a bit more readable and adds comments
- avoids loading a slice for values when only docs with field are requested
- saves some monotonic lookups
Here is an updated result of the benchmark (still with a threshold of 5% for benchmarking purposes, even though the patch still has a threshold of 1%), computed exactly the same way as above. It makes the slowdown a bit more contained. Times are in ms.
Field | sort performance on a MatchAllDocsQuery | sort performance on a term query that matches 10% of docs | sort performance on a term query that matches 1% of docs | sort performance on a term query that matches docs that have the field |
---|---|---|---|---|
cc2 | 128→99 (-23%) | 21.8→23.8 (+9%) | 2.92→4.33 (+48%) | 6.84→13.0 (+90%) |
admin4 | 121→98 (-19%) | 21.4→21.1 (-1%) | 3.65→2.81 (-23%) | 8.36→16.6 (+98%) |
admin3 | 116→125 (+1%) | 20.6→20.0 (-3%) | 3.20→3.24 (+1%) | 18.9→19.4 (+8%) |
admin2 | 124→132 (+6%) | 21.5→20.6 (-4%) | 3.30→3.49 (+6%) | 8.58→8.64 (+1%) |
I think the change is good to go, but I know this can be controversial. Please let me know if you have concerns, otherwise I plan to commit it by the end of the week.
Commit 1712957 from jpountz in branch 'dev/trunk'
[ https://svn.apache.org/r1712957 ]
LUCENE-6863: Optimized storage requirements of doc values fields when less than 1% of documents have a value.
Commit 1712973 from jpountz in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1712973 ]
LUCENE-6863: Optimized storage requirements of doc values fields when less than 1% of documents have a value.
why this Improvement is not commit in lucene 7.x and 8.x?
Is there some negative impact for lucene?
It is in every version of Lucene since 5.4. I suspect what gave rise to this confusion is that the "Fix Version" has two versions. If it only had 5.4 then I think you wouldn't of asked? FWIW I always use one version when resolving my issues since I find doing otherwise to be confusing and redundant.
Here is a patch. It detects sparsity by checking if less than 10% documents have a value. When this happens, it stores doc IDs that have a value using a DirectMonotonicWriter and non-missing values as a regular numeric field (so values will still be compressed with table/gcd compression if applicable).
I first wanted to update the format javadocs but noticed that the low-level javadocs were very outdated (still documenting that sorted doc values were stored in a FST) so I decided to remove the low-level description in favour of the high-level one, which is much more interesting in my opinion.
This is just a first iteration, tests pass but maybe the heuristic needs to be better thought. I will also do some benchmarking.