Lucene - Core
  1. Lucene - Core
  2. LUCENE-1478

Missing possibility to supply custom FieldParser when sorting search results

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was confronted by the problem that the special trie-encoded values (which are longs in a special encoding) cannot be sorted by Searcher.search() and SortField. The problem is: If you use SortField.LONG, you get NumberFormatExceptions. The trie encoded values may be sorted using SortField.String (as the encoding is in such a way, that they are sortable as Strings), but this is very memory ineffective.

      ExtendedFieldCache gives the possibility to specify a custom LongParser when retrieving the cached values. But you cannot use this during searching, because there is no possibility to supply this custom LongParser to the SortField.

      I propose a change in the sort classes:
      Include a pointer to the parser instance to be used in SortField (if not given use the default). My idea is to create a SortField using a new constructor

      SortField(String field, int type, Object parser, boolean reverse)

      The parser is "object" because all current parsers have no super-interface. The ideal solution would be to have:

      SortField(String field, int type, FieldCache.Parser parser, boolean reverse)

      and FieldCache.Parser is a super-interface (just empty, more like a marker-interface) of all other parsers (like LongParser...). The sort implementation then must be changed to respect the given parser (if not NULL), else use the default FieldCache.getXXXX without parser.

      1. LUCENE-1478.patch
        37 kB
        Michael McCandless
      2. LUCENE-1478.patch
        34 kB
        Michael McCandless
      3. LUCENE-1478.patch
        32 kB
        Uwe Schindler
      4. LUCENE-1478.patch
        31 kB
        Uwe Schindler
      5. LUCENE-1478.patch
        31 kB
        Uwe Schindler
      6. LUCENE-1478-cleanup.patch
        3 kB
        Uwe Schindler
      7. LUCENE-1478-no-superinterface.patch
        13 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          Apologies, I meant to post in LUCENE-1483

          Show
          Yonik Seeley added a comment - Apologies, I meant to post in LUCENE-1483
          Hide
          Uwe Schindler added a comment -

          Ah, now I understand (by the way, I do not understand Solrs ExternalFileField in complete, too):

          With my workaround, I did not want to work around the problem of a changed external file. I propsed, not to implement a own FieldCache for the whole thing and just use the new SortField's parser to fill the standard FloatCache during searching/filtering. Just use the external (static) file and convert the the values from the index using the HashMap in ExternalFileField to the floats. This parser is a singleton for each ExternalFileField.

          But I think we should wait, until Yonik explains us the problem more detail.

          Show
          Uwe Schindler added a comment - Ah, now I understand (by the way, I do not understand Solrs ExternalFileField in complete, too): With my workaround, I did not want to work around the problem of a changed external file. I propsed, not to implement a own FieldCache for the whole thing and just use the new SortField's parser to fill the standard FloatCache during searching/filtering. Just use the external (static) file and convert the the values from the index using the HashMap in ExternalFileField to the floats. This parser is a singleton for each ExternalFileField. But I think we should wait, until Yonik explains us the problem more detail.
          Hide
          Michael McCandless added a comment -

          the parser should be a singleton for all field caches or have hashCode()/equals().

          I guess I don't understand your proposed workaround then – I thought
          you were explicitly proposing not using a singleton, so that you
          could force re-parsing of all values (even for old segments) when a
          new external file was being used.

          (Also likely my lack of understanding of Solr's ExternalFileField is
          adding to my confusion here – I haven't yet looked at it.)

          The Cache of FieldCache instances in FieldCacheImpl is a WeakHashMap, so unused FieldCache instances for no longer used readers/parsers are GC'ed.

          The WeakHashMap is keyed by the IndexReader, and then there's a normal
          HashMap (innerCache) keyed on field,Parser. So it's only when the
          IndexReader (SegmentReader) is no longer referenced elsewhere that the
          innerCache, in entirety, can be GCd. There's no reclaiming of stale
          entries in the innerCache.

          Show
          Michael McCandless added a comment - the parser should be a singleton for all field caches or have hashCode()/equals(). I guess I don't understand your proposed workaround then – I thought you were explicitly proposing not using a singleton, so that you could force re-parsing of all values (even for old segments) when a new external file was being used. (Also likely my lack of understanding of Solr's ExternalFileField is adding to my confusion here – I haven't yet looked at it.) The Cache of FieldCache instances in FieldCacheImpl is a WeakHashMap, so unused FieldCache instances for no longer used readers/parsers are GC'ed. The WeakHashMap is keyed by the IndexReader, and then there's a normal HashMap (innerCache) keyed on field,Parser. So it's only when the IndexReader (SegmentReader) is no longer referenced elsewhere that the innerCache, in entirety, can be GCd. There's no reclaiming of stale entries in the innerCache.
          Hide
          Uwe Schindler added a comment -

          By the way: The Cache of FieldCache instances in FieldCacheImpl is a WeakHashMap, so unused FieldCache instances for no longer used readers/parsers are GC'ed.

          Show
          Uwe Schindler added a comment - By the way: The Cache of FieldCache instances in FieldCacheImpl is a WeakHashMap, so unused FieldCache instances for no longer used readers/parsers are GC'ed.
          Hide
          Uwe Schindler added a comment -

          Uwe, would that result in a memory leak? Ie, a single long-lived segment would accumulate multiple entries for each new XXXParser instance used during sorting? (Unless there's logic to evict the "stale" entries).

          As noted before on Dec 9, 08, the parser should be a singleton for all field caches or have hashCode()/equals(). You create a FloatParser and supply it to SortField. When then sorting against this field, the new sort implementation would create a FieldCache for each segment. When one segment is reloaded, the FieldCache gets unused and a new one for the replacement segment is created (this is the same like with the standard field parser). The FieldCache is using (IndexReader,SortField.type,Parser) as key.

          If the FieldCache is used for CachingFilters, there is also no problem: The new search algorithm executes each filter's getDocIDSet() for each single SegmentReader.

          So the only problem is, that with the new search implementation, you cannot rely anymore on the fact, that for a MultiReader only one FieldCache exists and every filter's getDocIdSet() is executed only one time. So injecting a custom FieldCache into the cache for the whole MultiReader before search (by getting it) is not possible anymore.

          I tested the new search impl with trie fields, sorting works perfect using TrieUtils.getSortField()/TrieUtils.LONG_PARSER, no leaks (because FieldParser is singleton). I had to only modify the test case (Revision: 737079), because with an unoptimized index, the statistics for retrieving the number of visited terms did not work anymore (because the Filter was called more than once per search).

          The problem with stale entries, if the external file changes is another problem, that also happend before the new sort impl.

          It seems like LUCENE-831, which would expose / allow custom control over FieldCache's caching impl, would help here too.

          Yes!

          Show
          Uwe Schindler added a comment - Uwe, would that result in a memory leak? Ie, a single long-lived segment would accumulate multiple entries for each new XXXParser instance used during sorting? (Unless there's logic to evict the "stale" entries). As noted before on Dec 9, 08, the parser should be a singleton for all field caches or have hashCode()/equals(). You create a FloatParser and supply it to SortField. When then sorting against this field, the new sort implementation would create a FieldCache for each segment. When one segment is reloaded, the FieldCache gets unused and a new one for the replacement segment is created (this is the same like with the standard field parser). The FieldCache is using (IndexReader,SortField.type,Parser) as key. If the FieldCache is used for CachingFilters, there is also no problem: The new search algorithm executes each filter's getDocIDSet() for each single SegmentReader. So the only problem is, that with the new search implementation, you cannot rely anymore on the fact, that for a MultiReader only one FieldCache exists and every filter's getDocIdSet() is executed only one time. So injecting a custom FieldCache into the cache for the whole MultiReader before search (by getting it) is not possible anymore. I tested the new search impl with trie fields, sorting works perfect using TrieUtils.getSortField()/TrieUtils.LONG_PARSER, no leaks (because FieldParser is singleton). I had to only modify the test case (Revision: 737079), because with an unoptimized index, the statistics for retrieving the number of visited terms did not work anymore (because the Filter was called more than once per search). The problem with stale entries, if the external file changes is another problem, that also happend before the new sort impl. It seems like LUCENE-831 , which would expose / allow custom control over FieldCache's caching impl, would help here too. Yes!
          Hide
          Michael McCandless added a comment -

          Write a FloatParser that maps the the uniqueKey to a float value using the external file.

          Uwe, would that result in a memory leak? Ie, a single long-lived segment would accumulate multiple entries for each new XXXParser instance used during sorting? (Unless there's logic to evict the "stale" entries).

          It seems like LUCENE-831, which would expose / allow custom control over FieldCache's caching impl, would help here too.

          Show
          Michael McCandless added a comment - Write a FloatParser that maps the the uniqueKey to a float value using the external file. Uwe, would that result in a memory leak? Ie, a single long-lived segment would accumulate multiple entries for each new XXXParser instance used during sorting? (Unless there's logic to evict the "stale" entries). It seems like LUCENE-831 , which would expose / allow custom control over FieldCache's caching impl, would help here too.
          Hide
          Michael McCandless added a comment -

          Yonik, why was the failure so intermittent?

          So it sounds like Solr was relying on Lucene loading the entire
          float[] for all docs in the MultiSegmentReader, when only some
          segments were new? (And so it was LUCENE-1483 that caused the
          failure).

          Lucene implicitly assumes that a FieldCache's arrays do not change for
          a given segment; this is normally safe since the arrays are derived
          from the postings in the field (which are write once).

          But it sounds like Solr changed that assumption, and the values in the
          (Solr-subclass of) FieldCache's arrays are now derived from something
          external, which is no longer write once.

          How do you plan to fix it with Solr? It seems like, since you are
          maintaining a private cache, you could forcefully evict entries from
          the cache for all SegmentReaders whenever the external file has
          changed (or a new MultiSegmentReader had been opened)?

          Show
          Michael McCandless added a comment - Yonik, why was the failure so intermittent? So it sounds like Solr was relying on Lucene loading the entire float[] for all docs in the MultiSegmentReader, when only some segments were new? (And so it was LUCENE-1483 that caused the failure). Lucene implicitly assumes that a FieldCache's arrays do not change for a given segment; this is normally safe since the arrays are derived from the postings in the field (which are write once). But it sounds like Solr changed that assumption, and the values in the (Solr-subclass of) FieldCache's arrays are now derived from something external, which is no longer write once. How do you plan to fix it with Solr? It seems like, since you are maintaining a private cache, you could forcefully evict entries from the cache for all SegmentReaders whenever the external file has changed (or a new MultiSegmentReader had been opened)?
          Hide
          Uwe Schindler added a comment -

          After reading your comment several times and looking into ExternalFileField, the problem can maybe fixed using this issue:

          You problem may come from the fact, that you instantated this own FieldCache implementation, keyed it with the parent MultiReader and the new Sort implementation does not use it.

          Write a FloatParser that maps the the uniqueKey to a float value using the external file. If you want to use it with the new MultiReaderHitCollector sorting (LUCENE-1483), create the SortField using the parser interface parameter from this issue. When the field caches for searching are now rebuild (even only for one of them in a Multi(Segment)Reader)), the new Lucene search API will re-read the changed segments and build several new FieldCaches (standard ones) using the suplied parser. There is no longer the need for a extra FieldCache implementation with this issue.

          Is this your problem? Maybe this is why you wrote the comment here.

          Show
          Uwe Schindler added a comment - After reading your comment several times and looking into ExternalFileField, the problem can maybe fixed using this issue: You problem may come from the fact, that you instantated this own FieldCache implementation, keyed it with the parent MultiReader and the new Sort implementation does not use it. Write a FloatParser that maps the the uniqueKey to a float value using the external file. If you want to use it with the new MultiReaderHitCollector sorting ( LUCENE-1483 ), create the SortField using the parser interface parameter from this issue. When the field caches for searching are now rebuild (even only for one of them in a Multi(Segment)Reader)), the new Lucene search API will re-read the changed segments and build several new FieldCaches (standard ones) using the suplied parser. There is no longer the need for a extra FieldCache implementation with this issue. Is this your problem? Maybe this is why you wrote the comment here.
          Hide
          Uwe Schindler added a comment -

          I am not really sure, how this patch can can change this. Maybe you have more a problem with the new MultiReaderHitCollector patch (LUCENE-1483) - your comment looks like this? The current patch does not change anything, as long as you use the old SortField constructors.

          Can you explain a little bit more, maybe a test?

          Show
          Uwe Schindler added a comment - I am not really sure, how this patch can can change this. Maybe you have more a problem with the new MultiReaderHitCollector patch ( LUCENE-1483 ) - your comment looks like this? The current patch does not change anything, as long as you use the old SortField constructors. Can you explain a little bit more, maybe a test?
          Hide
          Yonik Seeley added a comment -

          I tracked down how this patch was causing Solr failures:

          ExternalFileField in Solr maps from a uniqueKey to a float value from a separate file.
          There is a cache that is essentially keyed by (IndexReader,field) that gives back a float[].

          Any change in the index used to cause all values to be updated (cache miss because the MultiReader was a different instance). Now, since it's called segment-at-a-time, only new segments are reloaded from the file, leaving older segments with stale values.

          It's certainly in the very gray area... but perhaps Solr won't be the only one affected by this - maybe apps that implement security filters, etc?

          Show
          Yonik Seeley added a comment - I tracked down how this patch was causing Solr failures: ExternalFileField in Solr maps from a uniqueKey to a float value from a separate file. There is a cache that is essentially keyed by (IndexReader,field) that gives back a float[]. Any change in the index used to cause all values to be updated (cache miss because the MultiReader was a different instance). Now, since it's called segment-at-a-time, only new segments are reloaded from the file, leaving older segments with stale values. It's certainly in the very gray area... but perhaps Solr won't be the only one affected by this - maybe apps that implement security filters, etc?
          Hide
          Uwe Schindler added a comment - - edited

          Just a note:
          For the FieldCache it is also important, that the parser is a singleton or implements hashCode() and equals(). If not, each call to sort with another SortField using a different parser instance (but from the same class) would create a new FieldCache. This is why I said, that SortComparators and Parsers should generally be made static final members (and so singletons) like I have done it in TrieUtils. With that you can be sure, that all SortFields hit the same cache entry when looking up using FieldCacheImpl.Entry. The implementation of hashCode and equals for parsers is the other variant, but it does not make really sense (as long as parsers and comparators do not have an instance-specific state).

          Show
          Uwe Schindler added a comment - - edited Just a note: For the FieldCache it is also important, that the parser is a singleton or implements hashCode() and equals(). If not, each call to sort with another SortField using a different parser instance (but from the same class) would create a new FieldCache. This is why I said, that SortComparators and Parsers should generally be made static final members (and so singletons) like I have done it in TrieUtils. With that you can be sure, that all SortFields hit the same cache entry when looking up using FieldCacheImpl.Entry. The implementation of hashCode and equals for parsers is the other variant, but it does not make really sense (as long as parsers and comparators do not have an instance-specific state).
          Hide
          Michael McCandless added a comment -

          No problem – Committed revision 724552. Thanks!

          Show
          Michael McCandless added a comment - No problem – Committed revision 724552. Thanks!
          Hide
          Uwe Schindler added a comment -

          Hi Mike,
          sorry, after looking a second time into the new SortField ctors, I chaged two cosmetic things:

          • The ctor for parser assigns this.type=type and later calls the init method with the member variable type. The init method assigns so the meber to the member agian. Cleaner is just call the initFieldType() method in the correct instanceof clause hit.
          • Moved the initFieldType() behind all ctors.

          But this is only cosmetic

          Show
          Uwe Schindler added a comment - Hi Mike, sorry, after looking a second time into the new SortField ctors, I chaged two cosmetic things: The ctor for parser assigns this.type=type and later calls the init method with the member variable type. The init method assigns so the meber to the member agian. Cleaner is just call the initFieldType() method in the correct instanceof clause hit. Moved the initFieldType() behind all ctors. But this is only cosmetic
          Hide
          Michael McCandless added a comment -

          Committed revision 724484.

          Thanks Uwe!

          Show
          Michael McCandless added a comment - Committed revision 724484. Thanks Uwe!
          Hide
          Uwe Schindler added a comment -

          Hi Mike,
          all is ok. The extra check is better than before! I think its ready for commit.
          Thanks for the discussions!

          Show
          Uwe Schindler added a comment - Hi Mike, all is ok. The extra check is better than before! I think its ready for commit. Thanks for the discussions!
          Hide
          Michael McCandless added a comment -

          New patch attached:

          Maybe you should add the parser here, too.

          OK done.

          The default Object equals/hashcode is enough for singletons.

          OK I updated javadoc to add "unless a singleton is always used".

          The additional null check is OK but in my opinion not needed, because field!=null when not one of the special RELEVANCE/DOCORDER sorts. For consistency we may add the check to the other ctors, too.

          Yeah I added that because the javadoc had previously said field could
          be null for DOC/SCORE sort type. OK I added that check for all ctors
          (added initFieldType private utility method).

          Show
          Michael McCandless added a comment - New patch attached: Maybe you should add the parser here, too. OK done. The default Object equals/hashcode is enough for singletons. OK I updated javadoc to add "unless a singleton is always used". The additional null check is OK but in my opinion not needed, because field!=null when not one of the special RELEVANCE/DOCORDER sorts. For consistency we may add the check to the other ctors, too. Yeah I added that because the javadoc had previously said field could be null for DOC/SCORE sort type. OK I added that check for all ctors (added initFieldType private utility method).
          Hide
          Uwe Schindler added a comment - - edited

          Hi Mike,

          patch looks good, checked each change of you with TortoiseMerge. All tests pass including Trie ones.

          The only comments: You added this java docs to hashCode() and equals() in the patch of LUCENE-1481. Maybe you should add the parser here, too.

            /** Returns a hash code value for this object.  If a
             *  {@link SortComparatorSource} was provided, it must
             *  properly implement hashCode. */
          

          But on the other hand side: If the parser and/or comparator are static singletons (like it is done by the TrieUtils factories) they are not needed to implement equals and hashcode. The default Object equals/hashcode is enough for singletons. And I think most parsers and comparators are singletons. A short not should be enough.

          The additional null check is OK but in my opinion not needed, because field!=null when not one of the special RELEVANCE/DOCORDER sorts. For consistency we may add the check to the other ctors, too.

          Show
          Uwe Schindler added a comment - - edited Hi Mike, patch looks good, checked each change of you with TortoiseMerge. All tests pass including Trie ones. The only comments: You added this java docs to hashCode() and equals() in the patch of LUCENE-1481 . Maybe you should add the parser here, too. /** Returns a hash code value for this object. If a * {@link SortComparatorSource} was provided, it must * properly implement hashCode. */ But on the other hand side: If the parser and/or comparator are static singletons (like it is done by the TrieUtils factories) they are not needed to implement equals and hashcode. The default Object equals/hashcode is enough for singletons. And I think most parsers and comparators are singletons. A short not should be enough. The additional null check is OK but in my opinion not needed, because field!=null when not one of the special RELEVANCE/DOCORDER sorts. For consistency we may add the check to the other ctors, too.
          Hide
          Michael McCandless added a comment -

          OK I made a few small changes to the patch: added CHANGES entry, touched up javadocs, and added null check for field in the new SortField ctors. I think it's ready to commit!

          Uwe can you look over my changes? Thanks.

          Show
          Michael McCandless added a comment - OK I made a few small changes to the patch: added CHANGES entry, touched up javadocs, and added null check for field in the new SortField ctors. I think it's ready to commit! Uwe can you look over my changes? Thanks.
          Hide
          Uwe Schindler added a comment -

          As LUCENE-1481 is committed, here the updated patch with SortField.hashCode() and equals()

          Show
          Uwe Schindler added a comment - As LUCENE-1481 is committed, here the updated patch with SortField.hashCode() and equals()
          Hide
          Uwe Schindler added a comment -

          New patch that integrates Mike's comments. This version still uses a super-interface. If we want to remove it, we can create 6*2 new constructors (for each parser with and without reverse) and use Object for the internal member variable holding the parser. In FieldSortedHitQueue, the parser is stored in the CacheEntry's custom member which is also Object..., so the supertype is only used in SortField.

          12 new constructors are (in my eyes) very bad and from the JavaDoc side very heavy. The only advantage of them is:

          • removing the super-interface
          • the if-queue of instanceof tests can be removed because of stronger typing
          Show
          Uwe Schindler added a comment - New patch that integrates Mike's comments. This version still uses a super-interface. If we want to remove it, we can create 6*2 new constructors (for each parser with and without reverse) and use Object for the internal member variable holding the parser. In FieldSortedHitQueue, the parser is stored in the CacheEntry's custom member which is also Object..., so the supertype is only used in SortField. 12 new constructors are (in my eyes) very bad and from the JavaDoc side very heavy. The only advantage of them is: removing the super-interface the if-queue of instanceof tests can be removed because of stronger typing
          Hide
          Michael McCandless added a comment -

          The problem is, that the comparator for byte fields returned SortField.INT in sortType() instead of SortField.BYTE.

          Whoa, nice catch! I think it's fine to include the fix here (it was not manifesting as a bug w/o your changes).

          The patch also provides a test case in "TestSort" that uses a custom parser, that maps a simple char value from 'A' to 'J' to a byte, short, int, long, float, double with some crazy algorithm to test sorting.

          Awesome, I like the new test!

          Additionally the new SortField constructor checks the sort type and corresponding parser for consistency and throws IllegalArgumentException, if not correct

          I like this stronger typing. Could you change the message thrown exception to detail the type that was given and what parser was provided (better transparency)?

          SortField could have separate constructors for each parser type without type parameter

          That is a good idea, I think. The two args are redundant & therefore a source of confusion/error. It is annoying how we keep having to "multiply by N" so many places in Lucene that want to switch on the different builtin types (LUCENE-831 has this too).

          Changing this would require some more work in the FieldSortedHitQueue, as copying the SortField would require more work

          Can we just SortField.clone() instead of "new SortField(lots-of-tricky-args)" in FieldSortedHitQueue?

          this patch also extends contrib's TrieUtils and test cases to support a static trie-tolong/double-parser and a SortField factory to handle trie-encoded fields very simple using the long fieldcache

          Great!

          Show
          Michael McCandless added a comment - The problem is, that the comparator for byte fields returned SortField.INT in sortType() instead of SortField.BYTE. Whoa, nice catch! I think it's fine to include the fix here (it was not manifesting as a bug w/o your changes). The patch also provides a test case in "TestSort" that uses a custom parser, that maps a simple char value from 'A' to 'J' to a byte, short, int, long, float, double with some crazy algorithm to test sorting. Awesome, I like the new test! Additionally the new SortField constructor checks the sort type and corresponding parser for consistency and throws IllegalArgumentException, if not correct I like this stronger typing. Could you change the message thrown exception to detail the type that was given and what parser was provided (better transparency)? SortField could have separate constructors for each parser type without type parameter That is a good idea, I think. The two args are redundant & therefore a source of confusion/error. It is annoying how we keep having to "multiply by N" so many places in Lucene that want to switch on the different builtin types ( LUCENE-831 has this too). Changing this would require some more work in the FieldSortedHitQueue, as copying the SortField would require more work Can we just SortField.clone() instead of "new SortField(lots-of-tricky-args)" in FieldSortedHitQueue? this patch also extends contrib's TrieUtils and test cases to support a static trie-tolong/double-parser and a SortField factory to handle trie-encoded fields very simple using the long fieldcache Great!
          Hide
          Uwe Schindler added a comment -

          I forget to mention, this patch also extends contrib's TrieUtils and test cases to support a static trie-tolong/double-parser and a SortField factory to handle trie-encoded fields very simple using the long fieldcache (for sorting, there need to be no difference between long/double, it will work for all 3 trie encodings and long is the simpliest to handle an compare) - for which this patch was mainly invented. The double parser instance is only supplied for other usages of FieldCache from TrieUtils.

          Show
          Uwe Schindler added a comment - I forget to mention, this patch also extends contrib's TrieUtils and test cases to support a static trie-tolong/double-parser and a SortField factory to handle trie-encoded fields very simple using the long fieldcache (for sorting, there need to be no difference between long/double, it will work for all 3 trie encodings and long is the simpliest to handle an compare) - for which this patch was mainly invented. The double parser instance is only supplied for other usages of FieldCache from TrieUtils.
          Hide
          Uwe Schindler added a comment -

          Here is the patch using the superinterface for all field parsers. The patch also provides a test case in "TestSort" that uses a custom parser, that maps a simple char value from 'A' to 'J' to a byte, short, int, long, float, double with some crazy algorithm to test sorting.

          Also fixed is the bug with the sortType() in comparator as mentioned before, without this, the test case for bytes would not pass.

          Additionally the new SortField constructor checks the sort type and corresponding parser for consistency and throws IllegalArgumentException, if not correct (e.g. using a LongParser with SortField.BYTE). During inventing this patch I had another idea to fix the issue with not to have a new super-interface: SortField could have separate constructors for each parser type without type parameter (as type is forced by parser instance – we could also remove the type parameter in the current patch, as the type maybe set by instanceof operator). Changing this would require some more work in the FieldSortedHitQueue, as copying the SortField would require more work (but could be fixed better as is is currently: a new SortField instance in the constructor is only needed, if SortField.AUTO or CUSTOM was used, in all other cases, the SortField instance can just be directly used).

          Just give some comments!

          Show
          Uwe Schindler added a comment - Here is the patch using the superinterface for all field parsers. The patch also provides a test case in "TestSort" that uses a custom parser, that maps a simple char value from 'A' to 'J' to a byte, short, int, long, float, double with some crazy algorithm to test sorting. Also fixed is the bug with the sortType() in comparator as mentioned before, without this, the test case for bytes would not pass. Additionally the new SortField constructor checks the sort type and corresponding parser for consistency and throws IllegalArgumentException, if not correct (e.g. using a LongParser with SortField.BYTE). During inventing this patch I had another idea to fix the issue with not to have a new super-interface: SortField could have separate constructors for each parser type without type parameter (as type is forced by parser instance – we could also remove the type parameter in the current patch, as the type maybe set by instanceof operator). Changing this would require some more work in the FieldSortedHitQueue, as copying the SortField would require more work (but could be fixed better as is is currently: a new SortField instance in the constructor is only needed, if SortField.AUTO or CUSTOM was used, in all other cases, the SortField instance can just be directly used). Just give some comments!
          Hide
          Uwe Schindler added a comment -

          I found a hidden bug in FieldSortedHitQueue that materialized when writing a TestCase for a SortField.BYTE sorting with custom parser (it took me a long time to find out whats happening). The problem is, that the comparator for byte fields returned SortField.INT in sortType() instead of SortField.BYTE. The effect of this was, that my new code crahsed when generating the new SortField in the constructor because of the incorrect Parser/SortField type. The other problem was, that on later calling the cache again, it would return another hit, because the type was different. The problem is, that this does not affect correctness of the previous implementation, but may coud lead to errors, like with my new impl.

          I fix this bug (its just one line) in my next patch (I will supply it shortly). Is it OK, or should I open a separate bug report?

          Show
          Uwe Schindler added a comment - I found a hidden bug in FieldSortedHitQueue that materialized when writing a TestCase for a SortField.BYTE sorting with custom parser (it took me a long time to find out whats happening). The problem is, that the comparator for byte fields returned SortField.INT in sortType() instead of SortField.BYTE. The effect of this was, that my new code crahsed when generating the new SortField in the constructor because of the incorrect Parser/SortField type. The other problem was, that on later calling the cache again, it would return another hit, because the type was different. The problem is, that this does not affect correctness of the previous implementation, but may coud lead to errors, like with my new impl. I fix this bug (its just one line) in my next patch (I will supply it shortly). Is it OK, or should I open a separate bug report?
          Hide
          Michael McCandless added a comment -

          Patch looks good, thanks Uwe! Back compat looks preserved; while some
          APIs (FieldSortedHitQueue.getCachedComparator) were changed, they are
          package private.

          Back-compat tests ("ant test-tag") pass as well.

          For testing, I modified one of my contrib TrieRangeQuery test cases locally to sort using a custom LongParser that decoded the encoded longs in the cache [parseLong(value) returns TrieUtils.trieCodedToLong(value)].

          It looks like this didn't make it into the patch – could you add it?

          Actually, adding a core test case would also be good. It could be
          something silly, eg that parses ints but negates them, and then assert
          that this yields the same result as the default IntParser with
          reverse=true (assuming no ties).

          If you like my patch, we could also discuss about using a super-interface for all Parsers. The modifications are rather simple (only the SortField constructor would be affected and some casts, and of course: the superinterface in all declarations inside FieldCache, ExtendedFieldCache)

          I agree, I would like to at least get some minimal static typing into
          the API (Object is not ideal) even if it's simply a "marker" interface
          If you're sure this can be done, such that all changes to
          FieldCache/ExtendedFieldCache remain back compatibitle, then let's do
          it? And I think I do now agree: this can be done w/o breaking back
          compat. The only affected public methods should be your new SortField
          methods, which is fine (no public methods take "Object parser" as far
          as I can tell).

          Show
          Michael McCandless added a comment - Patch looks good, thanks Uwe! Back compat looks preserved; while some APIs (FieldSortedHitQueue.getCachedComparator) were changed, they are package private. Back-compat tests ("ant test-tag") pass as well. For testing, I modified one of my contrib TrieRangeQuery test cases locally to sort using a custom LongParser that decoded the encoded longs in the cache [parseLong(value) returns TrieUtils.trieCodedToLong(value)] . It looks like this didn't make it into the patch – could you add it? Actually, adding a core test case would also be good. It could be something silly, eg that parses ints but negates them, and then assert that this yields the same result as the default IntParser with reverse=true (assuming no ties). If you like my patch, we could also discuss about using a super-interface for all Parsers. The modifications are rather simple (only the SortField constructor would be affected and some casts, and of course: the superinterface in all declarations inside FieldCache, ExtendedFieldCache) I agree, I would like to at least get some minimal static typing into the API (Object is not ideal) even if it's simply a "marker" interface If you're sure this can be done, such that all changes to FieldCache/ExtendedFieldCache remain back compatibitle, then let's do it? And I think I do now agree: this can be done w/o breaking back compat. The only affected public methods should be your new SortField methods, which is fine (no public methods take "Object parser" as far as I can tell).
          Hide
          Uwe Schindler added a comment -

          Attached is a patch that implements the first variant (without super interface for all FieldParsers). All current tests pass. A special test case for a custum field parser was not implemented.

          For testing, I modified one of my contrib TrieRangeQuery test cases locally to sort using a custom LongParser that decoded the encoded longs in the cache [parseLong(value) returns TrieUtils.trieCodedToLong(value)].

          A good test case would be to store some dates in ISO format in a field and then sort it as longs after parsing using SimpleDateFormat. This would be another typical use case (sorting by date, but not using SortField.STRING to minimize memory usage).

          If you like my patch, we could also discuss about using a super-interface for all Parsers. The modifications are rather simple (only the SortField constructor would be affected and some casts, and of course: the superinterface in all declarations inside FieldCache, ExtendedFieldCache)

          Show
          Uwe Schindler added a comment - Attached is a patch that implements the first variant (without super interface for all FieldParsers). All current tests pass. A special test case for a custum field parser was not implemented. For testing, I modified one of my contrib TrieRangeQuery test cases locally to sort using a custom LongParser that decoded the encoded longs in the cache [parseLong(value) returns TrieUtils.trieCodedToLong(value)] . A good test case would be to store some dates in ISO format in a field and then sort it as longs after parsing using SimpleDateFormat. This would be another typical use case (sorting by date, but not using SortField.STRING to minimize memory usage). If you like my patch, we could also discuss about using a super-interface for all Parsers. The modifications are rather simple (only the SortField constructor would be affected and some casts, and of course: the superinterface in all declarations inside FieldCache, ExtendedFieldCache)

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development