Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently for Strings you have SORTED and SORTED_SET, capable of single and multiple values per document respectively.

      For multi-numerics, there are only a few choices:

      • encode with NumericUtils into byte[]'s and store with SORTED_SET.
      • encode yourself per-document into BINARY.

      Both of these techniques have problems:

      SORTED_SET isn't bad if you just want to do basic sorting (e.g. min/max) or faceting counts: most of the bloat in the "terms dict" is compressed away, and it optimizes the case where the data is actually single-valued, but it falls apart performance-wise if you want to do more complex stuff like solr's analytics component or elasticsearch's aggregations: the ordinals just get in your way and cause additional work, deref'ing each to a byte[] and then decoding that back to a number. Worst of all, any mathematical calculations are off because it discards frequency (deduplicates).

      using your own custom encoding in BINARY removes the unnecessary ordinal dereferencing, but you trade off bad compression and access: you have no real choice but to do something like vInt within each byte[] for the doc, which means even basic sorting (e.g. max) is slow as its not constant time. There is no chance for the codec to optimize things like dates with GCD compression or optimize the single-valued case because its just an opaque byte[].

      So I think it would be good to explore a simple long[] type that solves these problems.

      1. LUCENE-5748.patch
        188 kB
        Robert Muir
      2. LUCENE-5748.patch
        105 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Here's a prototype patch: just the default codec impl and simple support for sorting on multiple numeric values.

        I didn't implement any simpletext, direct, memory, etc or all the other stuff needed yet.

        Here's what i think is a minimal API:

        /**
         * A list of per-document numeric values, sorted 
         * according to {@link Long#compare(long, long)}.
         */
        public abstract class SortedNumericDocValues {
          /** 
           * Positions to the specified document 
           */
          public abstract void setDocument(int doc);
          
          /** 
           * Retrieve the value for the current document at the specified index. 
           * An index ranges from {@code 0} to {@code count()-1}. 
           */
          public abstract long valueAt(int index);
          
          /** 
           * Retrieves the count of values for the current document. 
           * This may be zero if a document has no values.
           */
          public abstract int count();
        }
        
        Show
        Robert Muir added a comment - Here's a prototype patch: just the default codec impl and simple support for sorting on multiple numeric values. I didn't implement any simpletext, direct, memory, etc or all the other stuff needed yet. Here's what i think is a minimal API: /** * A list of per-document numeric values, sorted * according to {@link Long #compare( long , long )}. */ public abstract class SortedNumericDocValues { /** * Positions to the specified document */ public abstract void setDocument( int doc); /** * Retrieve the value for the current document at the specified index. * An index ranges from {@code 0} to {@code count()-1}. */ public abstract long valueAt( int index); /** * Retrieves the count of values for the current document. * This may be zero if a document has no values. */ public abstract int count(); }
        Hide
        Adrien Grand added a comment -

        +1 I like it!

        Show
        Adrien Grand added a comment - +1 I like it!
        Hide
        Robert Muir added a comment -

        Updated patch with impls for all codecs (4.9, disk, memory, direct, simpletext, etc), and with docs and tests. I think its ready.

        Show
        Robert Muir added a comment - Updated patch with impls for all codecs (4.9, disk, memory, direct, simpletext, etc), and with docs and tests. I think its ready.
        Hide
        Shai Erera added a comment -

        Reviewed the patch, looks good. Maybe add a message to some of the UOEs and IllegalStateExcs? I'm +1 to commit anyway.

        Show
        Shai Erera added a comment - Reviewed the patch, looks good. Maybe add a message to some of the UOEs and IllegalStateExcs? I'm +1 to commit anyway.
        Hide
        Robert Muir added a comment -

        Those exceptions match the exceptions of the code around them for consistency, e.g. they are consistent with what we do for all the other dv types. why have a special exception message that is different just for this type.

        moreover, they are impossible to hit. For example a norms field just cannot be multi-valued, and segmentreader checks that the codec is only "asked" for a field if it actually has that type listed in fieldinfos. If something needs to be changed here, can you open a separate issue since its unrelated to this patch?

        Show
        Robert Muir added a comment - Those exceptions match the exceptions of the code around them for consistency, e.g. they are consistent with what we do for all the other dv types. why have a special exception message that is different just for this type. moreover, they are impossible to hit. For example a norms field just cannot be multi-valued, and segmentreader checks that the codec is only "asked" for a field if it actually has that type listed in fieldinfos. If something needs to be changed here, can you open a separate issue since its unrelated to this patch?
        Hide
        Shai Erera added a comment -

        As I said, +1 to commit anyway. I like exceptions w/ messages, but don't let it stop you from committing. The code is fine as it is already.

        Show
        Shai Erera added a comment - As I said, +1 to commit anyway. I like exceptions w/ messages, but don't let it stop you from committing. The code is fine as it is already.
        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1602277 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1602277 ]

        LUCENE-5748: Add SORTED_NUMERIC docvalues type

        Show
        ASF subversion and git services added a comment - Commit 1602277 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1602277 ] LUCENE-5748 : Add SORTED_NUMERIC docvalues type
        Hide
        ASF subversion and git services added a comment -

        Commit 1602286 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1602286 ]

        LUCENE-5748: Add SORTED_NUMERIC docvalues type

        Show
        ASF subversion and git services added a comment - Commit 1602286 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1602286 ] LUCENE-5748 : Add SORTED_NUMERIC docvalues type

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development