Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7460

Should SortedNumericDocValues expose a per-document random-access API?

    Details

    • Type: Wish
    • Status: Resolved
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Sorted numerics used to expose a per-document random-access API so that accessing the median or max element would be cheap. The new SortedNumericDocValues still exposes the number of values a document has, but the only way to read values is to use

      {nextValue}

      , which forces to read all values in order to read the max value.

      For instance, SortedNumericSelector.MAX does the following in master (the important part is the for-loop):

          private void setValue() throws IOException {
            int count = in.docValueCount();
            for(int i=0;i<count;i++) {
              value = in.nextValue();
            }
          }
      
          @Override
          public int nextDoc() throws IOException {
            int docID = in.nextDoc();
            if (docID != NO_MORE_DOCS) {
              setValue();
            }
            return docID;
          }
      

      while it used to simply look up the value at index count-1 in 6.x:

          @Override
          public long get(int docID) {
            in.setDocument(docID);
            final int count = in.count();
            if (count == 0) {
              return 0; // missing
            } else {
              return in.valueAt(count-1);
            }
          }
      

      This could be a conscious decision since a sequential API gives more opportunities to the codec to compress efficiently, but on the other hand this API prevents sorting by max or median values to be efficient.

      On my end I have a preference for the random-access API.

        Issue Links

          Activity

          Hide
          dsmiley David Smiley added a comment -

          +1 to random access for within-document.

          Show
          dsmiley David Smiley added a comment - +1 to random access for within-document.
          Hide
          mikemccand Michael McCandless added a comment -

          I'm not convinced this is a good idea.

          E.g. Lucene's postings don't provide random access to each term occurrence within one document: you have to iterate them if you want to see them.

          And I think it's somewhat abusive to store such a huge number of numeric values in a single document that this API change would matter. For apps that really need a fast way to compute the min and max, they can index the min and max themselves?

          It's important we keep our search time APIs in check.

          Show
          mikemccand Michael McCandless added a comment - I'm not convinced this is a good idea. E.g. Lucene's postings don't provide random access to each term occurrence within one document: you have to iterate them if you want to see them. And I think it's somewhat abusive to store such a huge number of numeric values in a single document that this API change would matter. For apps that really need a fast way to compute the min and max, they can index the min and max themselves? It's important we keep our search time APIs in check.
          Hide
          jpountz Adrien Grand added a comment -

          Sorted numerics make it a bit hard to reason about to me since I am not very clear about the use-cases, but I guess that in some cases one would want to use the minimum value when sorting in ascending order and the max value when sorting in descending order, so having fast access to the maximum value too feels like an important feature. Of course users can index the min/max values directly but I think there is also some value in flexibility, eg. we do not require users to index edge n-grams to run prefix queries.

          That said I do not feel too strongly about it and mostly wanted to give some visibility to this change of our doc values API and discuss it. If you feel strongly about keeping the iterator API, I'm good with it.

          Show
          jpountz Adrien Grand added a comment - Sorted numerics make it a bit hard to reason about to me since I am not very clear about the use-cases, but I guess that in some cases one would want to use the minimum value when sorting in ascending order and the max value when sorting in descending order, so having fast access to the maximum value too feels like an important feature. Of course users can index the min/max values directly but I think there is also some value in flexibility, eg. we do not require users to index edge n-grams to run prefix queries. That said I do not feel too strongly about it and mostly wanted to give some visibility to this change of our doc values API and discuss it. If you feel strongly about keeping the iterator API, I'm good with it.

            People

            • Assignee:
              Unassigned
              Reporter:
              jpountz Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development