Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      There are two sorted byte[] types with DocValues (BYTES_VAR_SORTED,
      BYTES_FIXED_SORTED), so you can index this type, but you can't yet
      sort by it.

      So I added a FieldComparator just like TermOrdValComparator, except it
      pulls from the doc values instead.

      There are some small diffs, eg with doc values there are never null
      values (see LUCENE-3504).

      1. LUCENE-3518.patch
        36 kB
        Michael McCandless
      2. LUCENE-3518.patch
        42 kB
        Michael McCandless
      3. LUCENE-3518.patch
        46 kB
        Michael McCandless

        Issue Links

          Activity

          Michael McCandless created issue -
          Hide
          Michael McCandless added a comment -

          Patch.

          Show
          Michael McCandless added a comment - Patch.
          Michael McCandless made changes -
          Field Original Value New Value
          Attachment LUCENE-3518.patch [ 12499013 ]
          Hide
          Michael McCandless added a comment -

          New patch: I also added TermValDocValuesComparator, so you can
          sort-by-string when you indexed the non-sorted byte[] doc values types
          (var/fixed X straight/deref). This is slower since it aways does a
          .compareTo on each ByteRef instead of using pre-sorted ords... but for
          some apps this is a good tradeoff. (This mirrors the two sort-by-Term
          field comparators we have for field cache).

          I also cleaned up how we handle segments that had no doc values; this
          isn't necessarily an error condition because it could be the current
          search doesn't hit any docs in this segment. I added methods in
          IndexDocValues to get a Source or SortedSource that always returns the
          "default" value for the type.

          I think this is ready!

          Show
          Michael McCandless added a comment - New patch: I also added TermValDocValuesComparator, so you can sort-by-string when you indexed the non-sorted byte[] doc values types (var/fixed X straight/deref). This is slower since it aways does a .compareTo on each ByteRef instead of using pre-sorted ords... but for some apps this is a good tradeoff. (This mirrors the two sort-by-Term field comparators we have for field cache). I also cleaned up how we handle segments that had no doc values; this isn't necessarily an error condition because it could be the current search doesn't hit any docs in this segment. I added methods in IndexDocValues to get a Source or SortedSource that always returns the "default" value for the type. I think this is ready!
          Michael McCandless made changes -
          Attachment LUCENE-3518.patch [ 12499853 ]
          Hide
          Simon Willnauer added a comment -

          mike, this looks good while I am not sure I understand all the comparators. I didn't look at it for too long but maybe you can add some comments how those readerGen arrays work?

          there is one nocommit left in the patch though. I think you should go ahead though!

          simon

          Show
          Simon Willnauer added a comment - mike, this looks good while I am not sure I understand all the comparators. I didn't look at it for too long but maybe you can add some comments how those readerGen arrays work? there is one nocommit left in the patch though. I think you should go ahead though! simon
          Hide
          Simon Willnauer added a comment -

          mike, I looked at it again and found this api addition. You return a PackedInts.Reader from getDocToOrd() on SortedSource

          /**
           * Returns the PackedInts.Reader impl that maps document to ord.
           */
          public PackedInts.Reader getDocToOrd() {
            return null;
          }
          
          

          this seems to be very specific to in-memory docvalues while we don't support direct source here. Can we maybe extract an interface from PackedInts.Reader & RandomAccessReaderIterator (those are already very similar) so we can simply use the reader interface for both on-disk and in-memory variants. then I think this API addition is fine. I don't like apis which are in-mem / on-disk only.

          Show
          Simon Willnauer added a comment - mike, I looked at it again and found this api addition. You return a PackedInts.Reader from getDocToOrd() on SortedSource /** * Returns the PackedInts.Reader impl that maps document to ord. */ public PackedInts.Reader getDocToOrd() { return null ; } this seems to be very specific to in-memory docvalues while we don't support direct source here. Can we maybe extract an interface from PackedInts.Reader & RandomAccessReaderIterator (those are already very similar) so we can simply use the reader interface for both on-disk and in-memory variants. then I think this API addition is fine. I don't like apis which are in-mem / on-disk only.
          Hide
          Michael McCandless added a comment -

          Thanks Simon, I'll fix the nocommit and add some comments.. the
          ord/BytesRef comparators are tricky...

          this seems to be very specific to in-memory docvalues while we don't support direct source here.

          Hmm, true; though I suspect sorting by a direct source will be rather
          slow. Still I agree we should support it for completeness.

          Can we maybe extract an interface from PackedInts.Reader & RandomAccessReaderIterator (those are already very similar) so we can simply use the reader interface for both on-disk and in-memory variants.

          OK I like your idea of using PackedInts.Reader for on-disk and
          in-memory variants; you shouldn't have to pull an enum from PackedInts
          if you just need the disk-based get.

          I'll open a new issue to do that refactoring first; I think we can
          remove the PackedInts.RandomAccessReaderIterator if we just let the
          Reader.get throw IOE? Lemme try to work up a patch.

          Show
          Michael McCandless added a comment - Thanks Simon, I'll fix the nocommit and add some comments.. the ord/BytesRef comparators are tricky... this seems to be very specific to in-memory docvalues while we don't support direct source here. Hmm, true; though I suspect sorting by a direct source will be rather slow. Still I agree we should support it for completeness. Can we maybe extract an interface from PackedInts.Reader & RandomAccessReaderIterator (those are already very similar) so we can simply use the reader interface for both on-disk and in-memory variants. OK I like your idea of using PackedInts.Reader for on-disk and in-memory variants; you shouldn't have to pull an enum from PackedInts if you just need the disk-based get. I'll open a new issue to do that refactoring first; I think we can remove the PackedInts.RandomAccessReaderIterator if we just let the Reader.get throw IOE? Lemme try to work up a patch.
          Michael McCandless made changes -
          Link This issue is blocked by LUCENE-3524 [ LUCENE-3524 ]
          Hide
          Michael McCandless added a comment -

          Patch, syncing to trunk, and making it possible to sort off a direct source (though I don't think this is that useful in practice!).

          I think it's ready.

          Show
          Michael McCandless added a comment - Patch, syncing to trunk, and making it possible to sort off a direct source (though I don't think this is that useful in practice!). I think it's ready.
          Michael McCandless made changes -
          Attachment LUCENE-3518.patch [ 12502784 ]
          Michael McCandless made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development