[LUCENE-5748] SORTED_NUMERIC dv type - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.9, 6.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Currently for Strings you have SORTED and SORTED_SET, capable of single and multiple values per document respectively.

For multi-numerics, there are only a few choices:

encode with NumericUtils into byte[]'s and store with SORTED_SET.
encode yourself per-document into BINARY.

Both of these techniques have problems:

SORTED_SET isn't bad if you just want to do basic sorting (e.g. min/max) or faceting counts: most of the bloat in the "terms dict" is compressed away, and it optimizes the case where the data is actually single-valued, but it falls apart performance-wise if you want to do more complex stuff like solr's analytics component or elasticsearch's aggregations: the ordinals just get in your way and cause additional work, deref'ing each to a byte[] and then decoding that back to a number. Worst of all, any mathematical calculations are off because it discards frequency (deduplicates).

using your own custom encoding in BINARY removes the unnecessary ordinal dereferencing, but you trade off bad compression and access: you have no real choice but to do something like vInt within each byte[] for the doc, which means even basic sorting (e.g. max) is slow as its not constant time. There is no chance for the codec to optimize things like dates with GCD compression or optimize the single-valued case because its just an opaque byte[].

So I think it would be good to explore a simple long[] type that solves these problems.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-5748.patch
12/Jun/14 03:34
188 kB
Robert Muir
LUCENE-5748.patch
09/Jun/14 22:08
105 kB
Robert Muir

Activity

People

Assignee:: Unassigned

Reporter:: Robert Muir

Votes:: 2 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 09/Jun/14 22:06

Updated:: 28/Aug/22 14:09

Resolved:: 12/Jun/14 20:42