• Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: None
    • Labels:
    • Lucene Fields:


      Currently for Strings you have SORTED and SORTED_SET, capable of single and multiple values per document respectively.

      For multi-numerics, there are only a few choices:

      • encode with NumericUtils into byte[]'s and store with SORTED_SET.
      • encode yourself per-document into BINARY.

      Both of these techniques have problems:

      SORTED_SET isn't bad if you just want to do basic sorting (e.g. min/max) or faceting counts: most of the bloat in the "terms dict" is compressed away, and it optimizes the case where the data is actually single-valued, but it falls apart performance-wise if you want to do more complex stuff like solr's analytics component or elasticsearch's aggregations: the ordinals just get in your way and cause additional work, deref'ing each to a byte[] and then decoding that back to a number. Worst of all, any mathematical calculations are off because it discards frequency (deduplicates).

      using your own custom encoding in BINARY removes the unnecessary ordinal dereferencing, but you trade off bad compression and access: you have no real choice but to do something like vInt within each byte[] for the doc, which means even basic sorting (e.g. max) is slow as its not constant time. There is no chance for the codec to optimize things like dates with GCD compression or optimize the single-valued case because its just an opaque byte[].

      So I think it would be good to explore a simple long[] type that solves these problems.


        1. LUCENE-5748.patch
          188 kB
          Robert Muir
        2. LUCENE-5748.patch
          105 kB
          Robert Muir



            • Assignee:
              rcmuir Robert Muir
            • Votes:
              2 Vote for this issue
              7 Start watching this issue


              • Created: