Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9.
      Before this can be done, there are some things to think about:

      1. There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as it is. On the other hand, if this keeps our only numeric query implementation, we could call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here are problems). Same for the TokenStreams and Filters.
      2. Maybe the pairs of classes for indexing and searching should be moved into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The problem here: ctors must be able to pass int, long, double, float as range parameters. For the end user, mixing these 4 types in one class is hard to handle. If somebody forgets to add a L to a long, it suddenly instantiates a int version of range query, hitting no results and so on. Same with other types. Maybe accept java.lang.Number as parameter (because nullable for half-open bounds) and one enum for the type.
      3. TrieUtils move into o.a.l.util? or document or?
      4. Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? Somewhere else?
      5. If we rename the classes, should Solr stay with Trie (because there are different impls)?
      6. Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances?
      1. LUCENE-1673.patch
        127 kB
        Uwe Schindler
      2. LUCENE-1673.patch
        119 kB
        Uwe Schindler
      3. LUCENE-1673.patch
        119 kB
        Uwe Schindler
      4. LUCENE-1673.patch
        116 kB
        Uwe Schindler
      5. LUCENE-1673.patch
        103 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -
          • Committed addition to core in revision 786470
          • Committed remove from contrib in revision 786474

          Thanks for all your help! I will now open another issue for the remaining (optional) tasks

          Show
          Uwe Schindler added a comment - Committed addition to core in revision 786470 Committed remove from contrib in revision 786474 Thanks for all your help! I will now open another issue for the remaining (optional) tasks
          Hide
          Michael McCandless added a comment -

          Latest patch looks good Uwe! We can separately tweak the javadocs...

          Show
          Michael McCandless added a comment - Latest patch looks good Uwe! We can separately tweak the javadocs...
          Hide
          Uwe Schindler added a comment -

          Final patch version with updated javadocs. I will commit in a day or two
          When committing, I will also remove TrieRange from contrib/search (not included in patch).

          If you want to make javadocs updates, feel free to post an updated patch or do it after I committed.

          After that I will do some work for NumericField and NumericSortField as well as moving the parsers to FieldCache and make the plain-text-number parsers public there, too.

          Show
          Uwe Schindler added a comment - Final patch version with updated javadocs. I will commit in a day or two When committing, I will also remove TrieRange from contrib/search (not included in patch). If you want to make javadocs updates, feel free to post an updated patch or do it after I committed. After that I will do some work for NumericField and NumericSortField as well as moving the parsers to FieldCache and make the plain-text-number parsers public there, too.
          Hide
          Uwe Schindler added a comment -

          Here some intermediate update...

          Show
          Uwe Schindler added a comment - Here some intermediate update...
          Hide
          Michael McCandless added a comment -

          Want a convenience method for the user? TrieUtils.createDocumentField(...) , same as the sortField currently works.

          I don't think this is "convenient" enough.

          If you'd like to have end-to-end experience for numeric fields, build something schema-like and put it in contribs

          +1

          Long (medium?) term I'd love to get to this point; I think it'd make
          Lucene quite a bit more consumable. But we shouldn't sacrifice
          consumability today on the hope for that future nirvana.

          You already have a nice starting point here... is that something you
          could donate?

          I do agree that retrieving a doc is already "buggy", in that various things are lost from your index time doc (a well known issue at this point!)

          How on earth is it buggy? You're working with an inverted index, you aren't supposed to get original document from it in the first place. It's like saying a hash function is buggy because it is not reversible.

          I completely agree: you're not supposed to get the original doc back.
          And the fact that Lucene's API now "pretends" you do, is wrong. We all
          agree to that, and that we need to fix Lucene.

          But, as things now stand, it's not yet fixed, so until it's fixed, I
          don't like intentionally making it worse.

          It'd be great to simply stop returning Document from IndexReader.
          Wanna make a patch? I don't think the new sheriff'd hold 2.9 for this
          though

          "hey how come I didn't get a NumericField back on my doc?

          Perhaps a good reason to not add a NumericField.

          I think NumericField (when building your doc) is still valuable, even
          if we can't return NumericField when you retrieve the doc.

          OK... since adding the bit to the stored fields is controversial, I
          think for 2.9, we should only add NumericField at indexing (document
          creation) time. So, we don't store a new bit in stored fields file
          and the index format is unchanged.

          Show
          Michael McCandless added a comment - Want a convenience method for the user? TrieUtils.createDocumentField(...) , same as the sortField currently works. I don't think this is "convenient" enough. If you'd like to have end-to-end experience for numeric fields, build something schema-like and put it in contribs +1 Long (medium?) term I'd love to get to this point; I think it'd make Lucene quite a bit more consumable. But we shouldn't sacrifice consumability today on the hope for that future nirvana. You already have a nice starting point here... is that something you could donate? I do agree that retrieving a doc is already "buggy", in that various things are lost from your index time doc (a well known issue at this point!) How on earth is it buggy? You're working with an inverted index, you aren't supposed to get original document from it in the first place. It's like saying a hash function is buggy because it is not reversible. I completely agree: you're not supposed to get the original doc back. And the fact that Lucene's API now "pretends" you do, is wrong. We all agree to that, and that we need to fix Lucene. But, as things now stand, it's not yet fixed, so until it's fixed, I don't like intentionally making it worse. It'd be great to simply stop returning Document from IndexReader. Wanna make a patch? I don't think the new sheriff'd hold 2.9 for this though "hey how come I didn't get a NumericField back on my doc? Perhaps a good reason to not add a NumericField. I think NumericField (when building your doc) is still valuable, even if we can't return NumericField when you retrieve the doc. OK... since adding the bit to the stored fields is controversial, I think for 2.9, we should only add NumericField at indexing (document creation) time. So, we don't store a new bit in stored fields file and the index format is unchanged.
          Hide
          Michael McCandless added a comment -

          Note that LUCENE-1505 is open for cutting over contrib/spacial to NumericUtils....

          Show
          Michael McCandless added a comment - Note that LUCENE-1505 is open for cutting over contrib/spacial to NumericUtils....
          Hide
          Earwin Burrfoot added a comment -

          This is that baking in a specific implementation into the index format that I don't like.

          +many

          I do agree that retrieving a doc is already "buggy", in that various things are lost from your index time doc (a well known issue at this point!)

          How on earth is it buggy? You're working with an inverted index, you aren't supposed to get original document from it in the first place. It's like saying a hash function is buggy because it is not reversible.

          The less coupling various lucene components have on each other - the better. If you'd like to have end-to-end experience for numeric fields, build something schema-like and put it in contribs. If it's hard to build - Lucene core is to blame, it's not extensible enough. From my experience, for that purporse it's okay as it is.

          Show
          Earwin Burrfoot added a comment - This is that baking in a specific implementation into the index format that I don't like. +many I do agree that retrieving a doc is already "buggy", in that various things are lost from your index time doc (a well known issue at this point!) How on earth is it buggy? You're working with an inverted index, you aren't supposed to get original document from it in the first place. It's like saying a hash function is buggy because it is not reversible. The less coupling various lucene components have on each other - the better. If you'd like to have end-to-end experience for numeric fields, build something schema-like and put it in contribs. If it's hard to build - Lucene core is to blame, it's not extensible enough. From my experience, for that purporse it's okay as it is.
          Hide
          Yonik Seeley added a comment - - edited

          But we are already "baking in" the trie indexing format? That's what "moving trie to core" implies.

          Nah - no more than the porter stemmer or any other type of analysis is "baked in".
          I thought "move" meant "rename" (package and class name). Upgrading it's stability and how "core" it was.

          "hey how come I didn't get a NumericField back on my doc?

          Perhaps a good reason to not add a NumericField. It doesn't currently exist and is not necessary for Trie.
          Want a convenience method for the user? TrieUtils.createDocumentField(...) , same as the sortField currently works.

          The current Trie behavior works the same way everything else does in Lucene... changing that and encoding types into the index deserves it's own issue and discussion (and something big like that doesn't seem to belong in 2.9 which is winding down).

          Show
          Yonik Seeley added a comment - - edited But we are already "baking in" the trie indexing format? That's what "moving trie to core" implies. Nah - no more than the porter stemmer or any other type of analysis is "baked in". I thought "move" meant "rename" (package and class name). Upgrading it's stability and how "core" it was. "hey how come I didn't get a NumericField back on my doc? Perhaps a good reason to not add a NumericField. It doesn't currently exist and is not necessary for Trie. Want a convenience method for the user? TrieUtils.createDocumentField(...) , same as the sortField currently works. The current Trie behavior works the same way everything else does in Lucene... changing that and encoding types into the index deserves it's own issue and discussion (and something big like that doesn't seem to belong in 2.9 which is winding down).
          Hide
          Michael McCandless added a comment -

          We could easily add "numeric"; then FieldsReader would return a NumericField.

          This is that baking in a specific implementation into the index format that I don't like.

          But we are already "baking in" the trie indexing format? That's what
          "moving trie to core" implies. Lucene can now index numbers, well,
          and has committed to a certain approach (trie).

          The term dict of a numeric field is trie encoded, each doc field is
          indexed under a series of trie encoded tokens (w/ different
          precisions), etc.

          Sure, in the future we may find improvements to how Lucene indexes
          numbers, by why choose to be buggy today ("hey how come I didn't get a
          NumericField back on my doc?") for this possible future that may or
          may not come? If/when that future arrives, we can improve the index
          format at that point rather than intentionally create buggy code
          today?

          I do agree that retrieving a doc is already "buggy", in that various
          things are lost from your index time doc (a well known issue at this
          point!), but I don't think we should intentionally make that behavior
          even more buggy, if we can help it...

          Show
          Michael McCandless added a comment - We could easily add "numeric"; then FieldsReader would return a NumericField. This is that baking in a specific implementation into the index format that I don't like. But we are already "baking in" the trie indexing format? That's what "moving trie to core" implies. Lucene can now index numbers, well, and has committed to a certain approach (trie). The term dict of a numeric field is trie encoded, each doc field is indexed under a series of trie encoded tokens (w/ different precisions), etc. Sure, in the future we may find improvements to how Lucene indexes numbers, by why choose to be buggy today ("hey how come I didn't get a NumericField back on my doc?") for this possible future that may or may not come? If/when that future arrives, we can improve the index format at that point rather than intentionally create buggy code today? I do agree that retrieving a doc is already "buggy", in that various things are lost from your index time doc (a well known issue at this point!), but I don't think we should intentionally make that behavior even more buggy, if we can help it...
          Hide
          Michael McCandless added a comment -

          With a NumericTermQuery you would only hit the document exactly on the same millisecond.

          Couldn't I quantize my times (say, by day), then index using
          NumericField, then use NumericRangeQuery for certain cases and
          NumericTermQuery if I want to precisely match one day?

          Or for non-Date fields it might be apply too... maybe I'm searching
          for a hard drive, so I drill down by range (size >= 1.0 TB) and then
          when there's only a few sizes in the set we offer up the exact choices
          (1.0, 1.5, 2.0 TB) and when user clicks on one of those we use a
          NumericTermQuery.

          Because of this different use cases, in my opinion, DateTools has its usage.

          OK I agree, we should leave DateTools un-deprecated. If/when we offer
          easier integration for Dates w/ Numeric*, we can reconsider
          deprecation at that point.

          Show
          Michael McCandless added a comment - With a NumericTermQuery you would only hit the document exactly on the same millisecond. Couldn't I quantize my times (say, by day), then index using NumericField, then use NumericRangeQuery for certain cases and NumericTermQuery if I want to precisely match one day? Or for non-Date fields it might be apply too... maybe I'm searching for a hard drive, so I drill down by range (size >= 1.0 TB) and then when there's only a few sizes in the set we offer up the exact choices (1.0, 1.5, 2.0 TB) and when user clicks on one of those we use a NumericTermQuery. Because of this different use cases, in my opinion, DateTools has its usage. OK I agree, we should leave DateTools un-deprecated. If/when we offer easier integration for Dates w/ Numeric*, we can reconsider deprecation at that point.
          Hide
          Yonik Seeley added a comment -

          We could easily add "numeric"; then FieldsReader would return a NumericField.

          This is that baking in a specific implementation into the index format that I don't like.
          There will be changes to Trie*, there will be other implementations of numerics by both us and other users. We don't need to strongly couple core indexing and the types of fields... they aren't coupled now except when the generic format of the index changes (like omitNorms, omitTf, indexed, etc).

          Show
          Yonik Seeley added a comment - We could easily add "numeric"; then FieldsReader would return a NumericField. This is that baking in a specific implementation into the index format that I don't like. There will be changes to Trie*, there will be other implementations of numerics by both us and other users. We don't need to strongly couple core indexing and the types of fields... they aren't coupled now except when the generic format of the index changes (like omitNorms, omitTf, indexed, etc).
          Hide
          Uwe Schindler added a comment -

          With a NumericTermQuery you would only hit the document exactly on the same millisecond.

          The good thing behind DateTools is, that you can index the date value as a term with some fixed precision, like months. Because of this, you can simply find specific month using one TermQuery. For Ranges, NumericRangeQuery is in most cases better (not with month resolution).

          With NumericRangeQuery it is hard to hit exactly one month using only one term, because the month boundaries in epoch milliseconds is not exactly a 2^n value. In my opinion:

          • DateTools is good to index very coarse dates, months, years out of a java.util.Date/Calendar. E.g. days where a room (document) is free in hotel. Users then can use term queries and ask is there any free room on a specific date, for a date range, it is not bad to use a conventional RangeQuery (only few terms affected).
          • Use NumericRangeQuery if you want to query any date range (even downto the millisecond). The important thing is: the lower precision terms are not at common date boundaries.

          Because of this different use cases, in my opinion, DateTools has its usage.

          Show
          Uwe Schindler added a comment - With a NumericTermQuery you would only hit the document exactly on the same millisecond. The good thing behind DateTools is, that you can index the date value as a term with some fixed precision, like months. Because of this, you can simply find specific month using one TermQuery. For Ranges, NumericRangeQuery is in most cases better (not with month resolution). With NumericRangeQuery it is hard to hit exactly one month using only one term, because the month boundaries in epoch milliseconds is not exactly a 2^n value. In my opinion: DateTools is good to index very coarse dates, months, years out of a java.util.Date/Calendar. E.g. days where a room (document) is free in hotel. Users then can use term queries and ask is there any free room on a specific date, for a date range, it is not bad to use a conventional RangeQuery (only few terms affected). Use NumericRangeQuery if you want to query any date range (even downto the millisecond). The important thing is: the lower precision terms are not at common date boundaries. Because of this different use cases, in my opinion, DateTools has its usage.
          Hide
          Michael McCandless added a comment -

          I only wanted to hear one more voice about DateTools, because for index size and so on, it may still be good to only index dates in date-granularity. With this, you can use a simple TermQuery to retrieve all docs for that day, with NumericRangeQuery you must create a NumericRangeQuery.newLongRange() on the unix ts from 0:00 on the day to 0:00 on the following day exclusive.

          Couldn't we have a NumericTermQuery for such cases? You have the full precision term in the index...

          Show
          Michael McCandless added a comment - I only wanted to hear one more voice about DateTools, because for index size and so on, it may still be good to only index dates in date-granularity. With this, you can use a simple TermQuery to retrieve all docs for that day, with NumericRangeQuery you must create a NumericRangeQuery.newLongRange() on the unix ts from 0:00 on the day to 0:00 on the following day exclusive. Couldn't we have a NumericTermQuery for such cases? You have the full precision term in the index...
          Hide
          Uwe Schindler added a comment -

          Actually, this need not be a limitation; FieldsWriter already writes bits recording details for each stored field (binary, tokenized, compressed (deprecated)). We could easily add "numeric"; then FieldsReader would return a NumericField

          With all problems in FieldsReader like need to have a LazyNumericField and so on... I cam around this two month ago when fixing this omitTf things there... But it may be an idea.

          I will do some changes to the current patch and fix javadocs and add these package.html parts. The SortField and FieldCache parts are done directly after this issue.

          I only wanted to hear one more voice about DateTools, because for index size and so on, it may still be good to only index dates in date-granularity. With this, you can use a simple TermQuery to retrieve all docs for that day, with NumericRangeQuery you must create a NumericRangeQuery.newLongRange() on the unix ts from 0:00 on the day to 0:00 on the following day exclusive.

          Show
          Uwe Schindler added a comment - Actually, this need not be a limitation; FieldsWriter already writes bits recording details for each stored field (binary, tokenized, compressed (deprecated)). We could easily add "numeric"; then FieldsReader would return a NumericField With all problems in FieldsReader like need to have a LazyNumericField and so on... I cam around this two month ago when fixing this omitTf things there... But it may be an idea. I will do some changes to the current patch and fix javadocs and add these package.html parts. The SortField and FieldCache parts are done directly after this issue. I only wanted to hear one more voice about DateTools, because for index size and so on, it may still be good to only index dates in date-granularity. With this, you can use a simple TermQuery to retrieve all docs for that day, with NumericRangeQuery you must create a NumericRangeQuery.newLongRange() on the unix ts from 0:00 on the day to 0:00 on the following day exclusive.
          Hide
          Michael McCandless added a comment -

          NumericField would only work for indexing, but when retrieving from index (stored fields), it would change to Field.

          Actually, this need not be a limitation; FieldsWriter already writes bits recording details for each stored field (binary, tokenized, compressed (deprecated)). We could easily add "numeric"; then FieldsReader would return a NumericField.

          Show
          Michael McCandless added a comment - NumericField would only work for indexing, but when retrieving from index (stored fields), it would change to Field. Actually, this need not be a limitation; FieldsWriter already writes bits recording details for each stored field (binary, tokenized, compressed (deprecated)). We could easily add "numeric"; then FieldsReader would return a NumericField.
          Hide
          Michael McCandless added a comment -

          I think deprecating DateTools makes sense, though, we should add simple code fragments showing the migration.

          NumericRangeQuery's javadocs are great, but I'd like to crispen it up, by decoupling "how you use it" from "how it's implemented". EG lead right off with a "for the impatient, this is how it's used", and then a separate section detailing how it works, what precisionStep means (and tradeoffs of high/low values for it), the reference to the full paper, etc.

          But we can iterate on the javadocs in the separate issue, too.

          Show
          Michael McCandless added a comment - I think deprecating DateTools makes sense, though, we should add simple code fragments showing the migration. NumericRangeQuery's javadocs are great, but I'd like to crispen it up, by decoupling "how you use it" from "how it's implemented". EG lead right off with a "for the impatient, this is how it's used", and then a separate section detailing how it works, what precisionStep means (and tradeoffs of high/low values for it), the reference to the full paper, etc. But we can iterate on the javadocs in the separate issue, too.
          Hide
          Uwe Schindler added a comment -

          What do you think about deprecating DateTools? I am not really sure. Maybe we should leave it (in contrast to NumberTools), but let the notice there, that it may be better to use NumericRangeQuery with the unix timestamp.

          The Javadocs are almost central, the entry point (linked from everywhere) is NumericRangeQuery. I only wanted to add a short note to package.html in analysis and search.

          Show
          Uwe Schindler added a comment - What do you think about deprecating DateTools? I am not really sure. Maybe we should leave it (in contrast to NumberTools), but let the notice there, that it may be better to use NumericRangeQuery with the unix timestamp. The Javadocs are almost central, the entry point (linked from everywhere) is NumericRangeQuery. I only wanted to add a short note to package.html in analysis and search.
          Hide
          Michael McCandless added a comment -

          Patch looks good Uwe! The only thing I think is missing is a single
          javadoc that shows the full usage of Numeric*, with code fragments.
          But, I think that should wait until we resolve the followon issues,
          here.

          I think the same, I should first resolve this and open some more issues

          Agreed, though I think some of these (NumericField, NumericSortField)
          are important to do for 2.9. Maybe others (adding support for the
          missing numeric types (byte & short)) can wait.

          Let's wrap this one up and move onto the next ones

          Show
          Michael McCandless added a comment - Patch looks good Uwe! The only thing I think is missing is a single javadoc that shows the full usage of Numeric*, with code fragments. But, I think that should wait until we resolve the followon issues, here. I think the same, I should first resolve this and open some more issues Agreed, though I think some of these (NumericField, NumericSortField) are important to do for 2.9. Maybe others (adding support for the missing numeric types (byte & short)) can wait. Let's wrap this one up and move onto the next ones
          Hide
          Uwe Schindler added a comment -

          re: NumericField - it wouldn't have back-compat issues, so it could be added any time - no need to link it to this issue or to rush it.

          I think the same, I should first resolve this and open some more issues

          Show
          Uwe Schindler added a comment - re: NumericField - it wouldn't have back-compat issues, so it could be added any time - no need to link it to this issue or to rush it. I think the same, I should first resolve this and open some more issues
          Hide
          Yonik Seeley added a comment -

          re: NumericField - it wouldn't have back-compat issues, so it could be added any time - no need to link it to this issue or to rush it.

          Show
          Yonik Seeley added a comment - re: NumericField - it wouldn't have back-compat issues, so it could be added any time - no need to link it to this issue or to rush it.
          Hide
          Uwe Schindler added a comment -

          Updated patch:

          • Remove ShiftAttribute
          • Rename TrieUtils and corresponding test to NumericUtils
          • Deprecate NumberTools (and added TODO hint to NumberUtils in spatial)
          • I was so free to also deprecate DateTools. maybe we keep this, but add a hint to better index the date as a unix timestamp. Any ideas?

          Still missing:

          • NumericField, I sleep one night more about it
          • [unneeded SortField factory, parsers] -> extra issue, maybe after 3.0
          Show
          Uwe Schindler added a comment - Updated patch: Remove ShiftAttribute Rename TrieUtils and corresponding test to NumericUtils Deprecate NumberTools (and added TODO hint to NumberUtils in spatial) I was so free to also deprecate DateTools. maybe we keep this, but add a hint to better index the date as a unix timestamp. Any ideas? Still missing: NumericField, I sleep one night more about it [unneeded SortField factory, parsers] -> extra issue, maybe after 3.0
          Hide
          Michael McCandless added a comment -

          OK let's open a new issue for how to best integrate/default SortField
          and FieldCache.

          Nevertheless, I would like to remove emphasis from NumericUtils (which is in realyity a helper class).

          +1

          For bytes, TrieRange is not very interesting, for shorts, maybe, but I would subsume them during indexing as simple integers. You could not speedup searching, but limit index size a little bit.

          Well, a RangeQuery on a "plain text" byte or short field requires
          sneakiness (knowing that you must zero-pad; keeping
          document.NumberUtils around); I think it's best if NumericXXX in
          Lucene handles all of java's native numeric types. And you want a
          byte[] or short[] out of FieldCache (to not waste RAM having to
          upgrade to an int[]).

          We can do this under the (a?) new issue too...

          The SortField factory is then the only parts really needed in NumericUtils, but not really. The parser is a singleton, works for all trie fields and could also live somewhere else or nowhere at all, if the Parsers all stay in FieldCache.

          (Under a new issue, but...) I'm not really a fan of leaving the parser
          in FieldCache and expecting a user to "know" to create the SortField
          with that parser. NumericSortField would make it much more consumable
          to "direct" Lucene users.

          Can we rename RangeQuery -> TextRangeQuery (TermRangeQuery), to make it clear that its range checking is by Term sort order.

          We can do this and deprecate the old one, but I added a note to Javadocs (see patch). I would do this outside of this issue.

          OK.

          One benefit of a rename is it's a reminder to users on upgrading to
          consider whether they should in fact switch to NumericRangeQuery.

          How about oal.util.NumericUtils instead of TrieUtils?

          That was my first idea, too. What to do with o.a.l.doc.NumberTools (deprecate?). And also update contrib/spatial to use NumericUtils instead of the copied and not really goo NumberUtils from Solr (Yonik said, it was written at a very early stage, and is not effective with UTF-8 encoding and the TermEnum posioning with the term prefixes). It would be a index-format change for spatial, but as the code was not yet released (in Lucene), the Lucene version should not use NumberUtils at all.

          +1 on both (if we can add byte/short to trie*); we should do this
          before 2.9 since we can still change locallucene's format. Maybe open
          a new issue for that, too? We're forking off new 2.9 issues left and
          right here!!

          I think, I remove the ShiftAttribute in complete, its really useless. Maybe, I add a getShift() method to NumericUtils, that returns the shift value of a Token/String. See java-dev mailing with Yonik.

          OK

          Did you think about / decide against making a NumericField (that'd set the right tokenStream itself)?

          Field is final and so I must extend AbstractField. But some methods of Document return Field and not AbstractField.

          Can we just un-final Field?

          NumericField would only work for indexing, but when retrieving from index (stored fields), it would change to Field.

          Maybe we should move this after the index-specific schemas and so on. Or document, that it can be only used for indexing.

          True, but we already have such "challenges" between index vs search
          time Document; documenting it it seems fine.

          By the way: How do you like the factories in NumericRangeQuery and the setValue methods, working like StringBuffer.append() in NumericTokenStream? This makes it really easy to index.

          I think this is great! I like that you return NumericTokenStream

          The only good thing of NumericField would be the possibility to automatically disable TF and Norms per default when indexing.

          Consumability (good defaults)! (And also not having to know that you
          must go and get a tokenStream from NumericUtils).

          Show
          Michael McCandless added a comment - OK let's open a new issue for how to best integrate/default SortField and FieldCache. Nevertheless, I would like to remove emphasis from NumericUtils (which is in realyity a helper class). +1 For bytes, TrieRange is not very interesting, for shorts, maybe, but I would subsume them during indexing as simple integers. You could not speedup searching, but limit index size a little bit. Well, a RangeQuery on a "plain text" byte or short field requires sneakiness (knowing that you must zero-pad; keeping document.NumberUtils around); I think it's best if NumericXXX in Lucene handles all of java's native numeric types. And you want a byte[] or short[] out of FieldCache (to not waste RAM having to upgrade to an int[]). We can do this under the (a?) new issue too... The SortField factory is then the only parts really needed in NumericUtils, but not really. The parser is a singleton, works for all trie fields and could also live somewhere else or nowhere at all, if the Parsers all stay in FieldCache. (Under a new issue, but...) I'm not really a fan of leaving the parser in FieldCache and expecting a user to "know" to create the SortField with that parser. NumericSortField would make it much more consumable to "direct" Lucene users. Can we rename RangeQuery -> TextRangeQuery (TermRangeQuery), to make it clear that its range checking is by Term sort order. We can do this and deprecate the old one, but I added a note to Javadocs (see patch). I would do this outside of this issue. OK. One benefit of a rename is it's a reminder to users on upgrading to consider whether they should in fact switch to NumericRangeQuery. How about oal.util.NumericUtils instead of TrieUtils? That was my first idea, too. What to do with o.a.l.doc.NumberTools (deprecate?). And also update contrib/spatial to use NumericUtils instead of the copied and not really goo NumberUtils from Solr (Yonik said, it was written at a very early stage, and is not effective with UTF-8 encoding and the TermEnum posioning with the term prefixes). It would be a index-format change for spatial, but as the code was not yet released (in Lucene), the Lucene version should not use NumberUtils at all. +1 on both (if we can add byte/short to trie*); we should do this before 2.9 since we can still change locallucene's format. Maybe open a new issue for that, too? We're forking off new 2.9 issues left and right here!! I think, I remove the ShiftAttribute in complete, its really useless. Maybe, I add a getShift() method to NumericUtils, that returns the shift value of a Token/String. See java-dev mailing with Yonik. OK Did you think about / decide against making a NumericField (that'd set the right tokenStream itself)? Field is final and so I must extend AbstractField. But some methods of Document return Field and not AbstractField. Can we just un-final Field? NumericField would only work for indexing, but when retrieving from index (stored fields), it would change to Field. Maybe we should move this after the index-specific schemas and so on. Or document, that it can be only used for indexing. True, but we already have such "challenges" between index vs search time Document; documenting it it seems fine. By the way: How do you like the factories in NumericRangeQuery and the setValue methods, working like StringBuffer.append() in NumericTokenStream? This makes it really easy to index. I think this is great! I like that you return NumericTokenStream The only good thing of NumericField would be the possibility to automatically disable TF and Norms per default when indexing. Consumability (good defaults)! (And also not having to know that you must go and get a tokenStream from NumericUtils).
          Hide
          Uwe Schindler added a comment -

          I think, I remove the ShiftAttribute in complete, its really useless. Maybe, I add a getShift() method to NumericUtils, that returns the shift value of a Token/String. See java-dev mailing with Yonik.

          Show
          Uwe Schindler added a comment - I think, I remove the ShiftAttribute in complete, its really useless. Maybe, I add a getShift() method to NumericUtils, that returns the shift value of a Token/String. See java-dev mailing with Yonik.
          Hide
          Uwe Schindler added a comment -

          Did you think about / decide against making a NumericField (that'd set the right tokenStream itself)?

          The problem currently is:

          • Field is final and so I must extend AbstractField. But some methods of Document return Field and not AbstractField.
          • NumericField would only work for indexing, but when retrieving from index (stored fields), it would change to Field.

          Maybe we should move this after the index-specific schemas and so on. Or document, that it can be only used for indexing.

          By the way: How do you like the factories in NumericRangeQuery and the setValue methods, working like StringBuffer.append() in NumericTokenStream? This makes it really easy to index.

          The only good thing of NumericField would be the possibility to automatically disable TF and Norms per default when indexing.

          Show
          Uwe Schindler added a comment - Did you think about / decide against making a NumericField (that'd set the right tokenStream itself)? The problem currently is: Field is final and so I must extend AbstractField. But some methods of Document return Field and not AbstractField. NumericField would only work for indexing, but when retrieving from index (stored fields), it would change to Field. Maybe we should move this after the index-specific schemas and so on. Or document, that it can be only used for indexing. By the way: How do you like the factories in NumericRangeQuery and the setValue methods, working like StringBuffer.append() in NumericTokenStream? This makes it really easy to index. The only good thing of NumericField would be the possibility to automatically disable TF and Norms per default when indexing.
          Hide
          Uwe Schindler added a comment -

          This will apply to int/long/float/double as well right? How would you do this (require a parser for only numeric sorts) back-compatibly? EG, the others (String, DOC, etc.) don't require a parser.

          Mike: This will apply to int/long/float/double as well right? How would you
          do this (require a parser for only numeric sorts) back-compatibly? EG,
          the others (String, DOC, etc.) don't require a parser.

          Yonik: Allow passing parser==null to get the default?

          We could alternatively make NumericSortField (subclassing SortField), that just uses the right parser?

          A factory method TrieUtils.getSortField() could also return the right SortField.

          I want to move this into a new issue after, I will open one.

          Nevertheless, I would like to remove emphasis from NumericUtils (which is in realyity a helper class). So I want to make the current human-readable numeric parsers public and also add the trie parsers to FieldCache.

          The SortField factory is then the only parts really needed in NumericUtils, but not really. The parser is a singleton, works for all trie fields and could also live somewhere else or nowhere at all, if the Parsers all stay in FieldCache.

          Should we support byte/short for trie indexed fields as well? (Since SortField, FieldCache support these numeric types too...).

          For bytes, TrieRange is not very interesting, for shorts, maybe, but I would subsume them during indexing as simple integers. You could not speedup searching, but limit index size a little bit.

          Could we change ShiftAttribute -> NumericShiftAttribute?

          No problem, I do this. There is also missing the link from the TokenStream in the javadocs to this, see also my reply in java-dev to Grants mail.

          Can we rename RangeQuery -> TextRangeQuery (TermRangeQuery), to make it clear that its range checking is by Term sort order.

          We can do this and deprecate the old one, but I added a note to Javadocs (see patch). I would do this outside of this issue.

          How about oal.util.NumericUtils instead of TrieUtils?

          That was my first idea, too. What to do with o.a.l.doc.NumberTools (deprecate?). And also update contrib/spatial to use NumericUtils instead of the copied and not really goo NumberUtils from Solr (Yonik said, it was written at a very early stage, and is not effective with UTF-8 encoding and the TermEnum posioning with the term prefixes). It would be a index-format change for spatial, but as the code was not yet released (in Lucene), the Lucene version should not use NumberUtils at all.

          Show
          Uwe Schindler added a comment - This will apply to int/long/float/double as well right? How would you do this (require a parser for only numeric sorts) back-compatibly? EG, the others (String, DOC, etc.) don't require a parser. Mike: This will apply to int/long/float/double as well right? How would you do this (require a parser for only numeric sorts) back-compatibly? EG, the others (String, DOC, etc.) don't require a parser. Yonik: Allow passing parser==null to get the default? We could alternatively make NumericSortField (subclassing SortField), that just uses the right parser? A factory method TrieUtils.getSortField() could also return the right SortField. I want to move this into a new issue after, I will open one. Nevertheless, I would like to remove emphasis from NumericUtils (which is in realyity a helper class). So I want to make the current human-readable numeric parsers public and also add the trie parsers to FieldCache. The SortField factory is then the only parts really needed in NumericUtils, but not really. The parser is a singleton, works for all trie fields and could also live somewhere else or nowhere at all, if the Parsers all stay in FieldCache. Should we support byte/short for trie indexed fields as well? (Since SortField, FieldCache support these numeric types too...). For bytes, TrieRange is not very interesting, for shorts, maybe, but I would subsume them during indexing as simple integers. You could not speedup searching, but limit index size a little bit. Could we change ShiftAttribute -> NumericShiftAttribute? No problem, I do this. There is also missing the link from the TokenStream in the javadocs to this, see also my reply in java-dev to Grants mail. Can we rename RangeQuery -> TextRangeQuery (TermRangeQuery), to make it clear that its range checking is by Term sort order. We can do this and deprecate the old one, but I added a note to Javadocs (see patch). I would do this outside of this issue. How about oal.util.NumericUtils instead of TrieUtils? That was my first idea, too. What to do with o.a.l.doc.NumberTools (deprecate?). And also update contrib/spatial to use NumericUtils instead of the copied and not really goo NumberUtils from Solr (Yonik said, it was written at a very early stage, and is not effective with UTF-8 encoding and the TermEnum posioning with the term prefixes). It would be a index-format change for spatial, but as the code was not yet released (in Lucene), the Lucene version should not use NumberUtils at all.
          Hide
          Yonik Seeley added a comment -

          This will apply to int/long/float/double as well right? How would you do this (require a parser for only numeric sorts) back-compatibly? EG, the others (String, DOC, etc.) don't require a parser.

          Allow passing parser==null to get the default?

          We could alternatively make NumericSortField (subclassing SortField), that just uses the right parser?

          A factory method TrieUtils.getSortField() could also return the right SortField.

          Show
          Yonik Seeley added a comment - This will apply to int/long/float/double as well right? How would you do this (require a parser for only numeric sorts) back-compatibly? EG, the others (String, DOC, etc.) don't require a parser. Allow passing parser==null to get the default? We could alternatively make NumericSortField (subclassing SortField), that just uses the right parser? A factory method TrieUtils.getSortField() could also return the right SortField.
          Hide
          Michael McCandless added a comment -

          The only open point is the name of TrieUtils, any idea for package and/or name?

          I think NumericUtils? (I'd like the naming to be consistent w/
          NumericRangeQuery, NumericTokenStream, since it's very much a public
          API, ie users must interact directly with it to get their SortField
          (maybe) and FieldCache parser).

          Leaving it util seems OK, since it's used by analysis & searching.

          Show
          Michael McCandless added a comment - The only open point is the name of TrieUtils, any idea for package and/or name? I think NumericUtils? (I'd like the naming to be consistent w/ NumericRangeQuery, NumericTokenStream, since it's very much a public API, ie users must interact directly with it to get their SortField (maybe) and FieldCache parser). Leaving it util seems OK, since it's used by analysis & searching.
          Hide
          Michael McCandless added a comment -

          So one using new code must always specify the parser when using SortField.INT (SortField.AUTO is already deprectaed so no problem).

          This will apply to int/long/float/double as well right? How would you
          do this (require a parser for only numeric sorts) back-compatibly? EG,
          the others (String, DOC, etc.) don't require a parser.

          We could alternatively make NumericSortField (subclassing SortField),
          that just uses the right parser?

          Did you think about / decide against making a NumericField (that'd set
          the right tokenStream itself)?

          Other questions/comments:

          • Could we change ShiftAttribute -> NumericShiftAttribute?
          • How about oal.util.NumericUtils instead of TrieUtils?
          • Can we rename RangeQuery -> TextRangeQuery (TermRangeQuery), to
            make it clear that its range checking is by Term sort order.
          • Should we support byte/short for trie indexed fields as well?
            (Since SortField, FieldCache support these numeric types too...).
          Show
          Michael McCandless added a comment - So one using new code must always specify the parser when using SortField.INT (SortField.AUTO is already deprectaed so no problem). This will apply to int/long/float/double as well right? How would you do this (require a parser for only numeric sorts) back-compatibly? EG, the others (String, DOC, etc.) don't require a parser. We could alternatively make NumericSortField (subclassing SortField), that just uses the right parser? Did you think about / decide against making a NumericField (that'd set the right tokenStream itself)? Other questions/comments: Could we change ShiftAttribute -> NumericShiftAttribute? How about oal.util.NumericUtils instead of TrieUtils? Can we rename RangeQuery -> TextRangeQuery (TermRangeQuery), to make it clear that its range checking is by Term sort order. Should we support byte/short for trie indexed fields as well? (Since SortField, FieldCache support these numeric types too...).
          Hide
          Uwe Schindler added a comment -

          Updated patch:

          • now with extended JavaDocs
          • additional tests for float/doubles
          • additional tests for equals/hashcode
          • changes.txt
          • lot of reformatting

          The only open point is the name of TrieUtils, any idea for package and/or name?

          Changes to FieldCache and SortField to always require a parser (see discussion with Yonik), which is a new case to be openend after this.

          Show
          Uwe Schindler added a comment - Updated patch: now with extended JavaDocs additional tests for float/doubles additional tests for equals/hashcode changes.txt lot of reformatting The only open point is the name of TrieUtils, any idea for package and/or name? Changes to FieldCache and SortField to always require a parser (see discussion with Yonik), which is a new case to be openend after this.
          Hide
          Uwe Schindler added a comment - - edited

          Here is a first patch, just for initial review of the API.

          • Only 5 new classes in core (NumericRangeQuery, NumericRangeFilter [each with factory for data types], NumericTokenStream [with one ctor using precStep, various setValue() methods], ShiftAttribute, TrieUtils)
          • TrieUtils still by that name, because a new name is not yet thought about (and how to deprecate NumberUtils from solr in contrib/spatial, deprecate o.a.l.document.NumberTools)
          • All tests rewritten & pass
          • Javadocs still incomplete and need to be rewritten (add package.html from contrib to search's/analysis' package.html and so on)

          You may ask, why I added these static factories to NumericRangeQuery/Filter, one could directly instantiate the class using any subclass of java.lang.Number? Because type safety to prevent people from doing things like new NumericRangeQuery(field,precStep,new Long(val1),new Float(val2)) which may lead to undefined behaviour. The second problem is missing type safety with auto-boxing in Java 5.

          Show
          Uwe Schindler added a comment - - edited Here is a first patch, just for initial review of the API. Only 5 new classes in core (NumericRangeQuery, NumericRangeFilter [each with factory for data types] , NumericTokenStream [with one ctor using precStep, various setValue() methods] , ShiftAttribute, TrieUtils) TrieUtils still by that name, because a new name is not yet thought about (and how to deprecate NumberUtils from solr in contrib/spatial, deprecate o.a.l.document.NumberTools) All tests rewritten & pass Javadocs still incomplete and need to be rewritten (add package.html from contrib to search's/analysis' package.html and so on) You may ask, why I added these static factories to NumericRangeQuery/Filter, one could directly instantiate the class using any subclass of java.lang.Number? Because type safety to prevent people from doing things like new NumericRangeQuery(field,precStep,new Long(val1),new Float(val2)) which may lead to undefined behaviour. The second problem is missing type safety with auto-boxing in Java 5.
          Hide
          Yonik Seeley added a comment -

          But on the other hand SortField.INT is also strongly linked to the plain text encoding of these tokens.

          Right - I agree that's not good, and SortField.INT can be misleading.

          Because of this, if we stay with SortField.INT and so on, I would tend to make the according Parser/FieldCache a required arg of SortField, defaulting to the current parsers for the deprecated backwards-compatibility.

          That makes sense. I think it also makes sense (in addition) to keep the factory-like method like TrieUtils.getSortField() that instantiates the right SortField for the user based on the trie params given (like precisionStep and friends).

          Show
          Yonik Seeley added a comment - But on the other hand SortField.INT is also strongly linked to the plain text encoding of these tokens. Right - I agree that's not good, and SortField.INT can be misleading. Because of this, if we stay with SortField.INT and so on, I would tend to make the according Parser/FieldCache a required arg of SortField, defaulting to the current parsers for the deprecated backwards-compatibility. That makes sense. I think it also makes sense (in addition) to keep the factory-like method like TrieUtils.getSortField() that instantiates the right SortField for the user based on the trie params given (like precisionStep and friends).
          Hide
          Uwe Schindler added a comment -

          use SortField.PREFIX_ENCODED_INT for the trie ones

          This needlessly couples the Trie stuff strongly to the SortField stuff. Something along the lines of the current TrieUtils.getIntSortField(fname, reverse) seems preferable.

          add TrieUtils.XxxParser to FieldCache (but accessible)

          The Trie parsers belong in the Trie class.

          re-use INT (and so on) in Sort and cache code, where the data type is meant (we already have this in lots of code around), but where the encoding is meant use PLAIN_TEXT_* vs. PREFIX_ENCODED_*.

          I didn't understand that sentence.

          But on the other hand SortField.INT is also strongly linked to the plain text encoding of these tokens. My proposal was to unlink the index encoding of numeric data types from the sorting/field cache code and its constants. So it should not make a difference if you encoded the long using Integer.toString() or TrieUtils, in both cases, sorting code is identical, only the parser is different.

          Because of this, if we stay with SortField.INT and so on, I would tend to make the according Parser/FieldCache a required arg of SortField, defaulting to the current parsers for the deprecated backwards-compatibility.

          So one using new code must always specify the parser when using SortField.INT (SortField.AUTO is already deprectaed so no problem). The same with FieldCache: always specify the parser when getting an instance. For that the current default parsers should be made public accessible.

          As far as what package it makes sense to go in... what about an analysis.numeric package

          TrieUtils is used in analysis and searching, this is why I tend to util. The NumericTokenStream is in analysis (in my not-yet-realeased patch), ShiftAttribute in analysis.tokenattributes and TrieRangeQuery/Filter in search.

          As a general comment, moving TrieRange to core should be moving it to the core and perhaps renaming the classes if we can think of a better name. Some of the other stuff belongs in a different issue.

          I think this is correct. I will post a patch soon, that leaves TrieUtils alive.

          Show
          Uwe Schindler added a comment - use SortField.PREFIX_ENCODED_INT for the trie ones This needlessly couples the Trie stuff strongly to the SortField stuff. Something along the lines of the current TrieUtils.getIntSortField(fname, reverse) seems preferable. add TrieUtils.XxxParser to FieldCache (but accessible) The Trie parsers belong in the Trie class. re-use INT (and so on) in Sort and cache code, where the data type is meant (we already have this in lots of code around), but where the encoding is meant use PLAIN_TEXT_* vs. PREFIX_ENCODED_*. I didn't understand that sentence. But on the other hand SortField.INT is also strongly linked to the plain text encoding of these tokens. My proposal was to unlink the index encoding of numeric data types from the sorting/field cache code and its constants. So it should not make a difference if you encoded the long using Integer.toString() or TrieUtils, in both cases, sorting code is identical, only the parser is different. Because of this, if we stay with SortField.INT and so on, I would tend to make the according Parser/FieldCache a required arg of SortField, defaulting to the current parsers for the deprecated backwards-compatibility. So one using new code must always specify the parser when using SortField.INT (SortField.AUTO is already deprectaed so no problem). The same with FieldCache: always specify the parser when getting an instance. For that the current default parsers should be made public accessible. As far as what package it makes sense to go in... what about an analysis.numeric package TrieUtils is used in analysis and searching, this is why I tend to util. The NumericTokenStream is in analysis (in my not-yet-realeased patch), ShiftAttribute in analysis.tokenattributes and TrieRangeQuery/Filter in search. As a general comment, moving TrieRange to core should be moving it to the core and perhaps renaming the classes if we can think of a better name. Some of the other stuff belongs in a different issue. I think this is correct. I will post a patch soon, that leaves TrieUtils alive.
          Hide
          Yonik Seeley added a comment -

          use SortField.PREFIX_ENCODED_INT for the trie ones

          This needlessly couples the Trie stuff strongly to the SortField stuff. Something along the lines of the current TrieUtils.getIntSortField(fname, reverse) seems preferable.

          add TrieUtils.XxxParser to FieldCache (but accessible)

          The Trie parsers belong in the Trie class.

          re-use INT (and so on) in Sort and cache code, where the data type is meant (we already have this in lots of code around), but where the encoding is meant use PLAIN_TEXT_* vs. PREFIX_ENCODED_*.

          I didn't understand that sentence.

          As far as what package it makes sense to go in... what about an analysis.numeric package?

          As a general comment, moving TrieRange to core should be moving it to the core and perhaps renaming the classes if we can think of a better name. Some of the other stuff belongs in a different issue.

          Show
          Yonik Seeley added a comment - use SortField.PREFIX_ENCODED_INT for the trie ones This needlessly couples the Trie stuff strongly to the SortField stuff. Something along the lines of the current TrieUtils.getIntSortField(fname, reverse) seems preferable. add TrieUtils.XxxParser to FieldCache (but accessible) The Trie parsers belong in the Trie class. re-use INT (and so on) in Sort and cache code, where the data type is meant (we already have this in lots of code around), but where the encoding is meant use PLAIN_TEXT_* vs. PREFIX_ENCODED_*. I didn't understand that sentence. As far as what package it makes sense to go in... what about an analysis.numeric package? As a general comment, moving TrieRange to core should be moving it to the core and perhaps renaming the classes if we can think of a better name. Some of the other stuff belongs in a different issue.
          Hide
          Uwe Schindler added a comment - - edited

          Here my own thoughts:

          But the latter can be refactored, to SortField.TRIE_XXX (not good name, as TRIE no longer used) and the parser instances can be added to FieldCache.

          From the API of FieldCache and sorting, in my opinion, it was always not a good idea, to link the encoding in index, to the impl everywhere.

          • deprecate SortField.INT and use SortField.PLAIN_TEXT_INT instead and so on
          • use SortField.PREFIX_ENCODED_INT for the trie ones (better name, this is the internal encoding name from TrieUtils)
          • the default parsers (private) in FieldCache renaming to also PlainText* (but accessible)
          • add TrieUtils.XxxParser to FieldCache (but accessible)
          • re-use INT (and so on) in Sort and cache code, where the data type is meant (we already have this in lots of code around), but where the encoding is meant use PLAIN_TEXT_* vs. PREFIX_ENCODED_*.

          I know these are hard changes, but we had a lot of productivity in the past here (thanks Shai, Jason, Michael), so there are a lot of new APIs that are very much decoupled from the underlying encoding. This would again rename a lot of internal parts. But because of deprecation, this could be done in-line with Shai's and Michael's and Jason's changes here.

          (sorry for lots of edits in the mailing list, this is much productive thoughts)

          Show
          Uwe Schindler added a comment - - edited Here my own thoughts: But the latter can be refactored, to SortField.TRIE_XXX (not good name, as TRIE no longer used) and the parser instances can be added to FieldCache. From the API of FieldCache and sorting, in my opinion, it was always not a good idea, to link the encoding in index, to the impl everywhere. deprecate SortField.INT and use SortField.PLAIN_TEXT_INT instead and so on use SortField.PREFIX_ENCODED_INT for the trie ones (better name, this is the internal encoding name from TrieUtils) the default parsers (private) in FieldCache renaming to also PlainText* (but accessible) add TrieUtils.XxxParser to FieldCache (but accessible) re-use INT (and so on) in Sort and cache code, where the data type is meant (we already have this in lots of code around), but where the encoding is meant use PLAIN_TEXT_* vs. PREFIX_ENCODED_*. I know these are hard changes, but we had a lot of productivity in the past here (thanks Shai, Jason, Michael), so there are a lot of new APIs that are very much decoupled from the underlying encoding. This would again rename a lot of internal parts. But because of deprecation, this could be done in-line with Shai's and Michael's and Jason's changes here. (sorry for lots of edits in the mailing list, this is much productive thoughts)
          Hide
          Uwe Schindler added a comment -

          I am currently preparing a first patch for NumericRangeQuery-to-core.

          The class NumericUtils (former TrieUtils) should be in o.a.l.util or o.a.l.document? At the moment, the public part of this class is only interesting to retrieve Parsers or SortField instances. But the latter can be refactored, to SortField.TRIE_XXX (not good name, as TRIE no longer used) and the parser instances can be added to FieldCache.

          For indexing or querying it is not required for end users, one can use NumericTokenStream and NumericRangeQuery for all his needs.

          So NumberUtils is more internal than before.

          Any thoughts?

          Show
          Uwe Schindler added a comment - I am currently preparing a first patch for NumericRangeQuery-to-core. The class NumericUtils (former TrieUtils) should be in o.a.l.util or o.a.l.document? At the moment, the public part of this class is only interesting to retrieve Parsers or SortField instances. But the latter can be refactored, to SortField.TRIE_XXX (not good name, as TRIE no longer used) and the parser instances can be added to FieldCache. For indexing or querying it is not required for end users, one can use NumericTokenStream and NumericRangeQuery for all his needs. So NumberUtils is more internal than before. Any thoughts?
          Hide
          Michael McCandless added a comment -

          NumericRangeQuery.newFloatRange(Float a, Float b, precisionStep) and so on.

          Could we also do this for a "term range"? Then, we could have a single RangeQuery that rewrites to the right impl based on what kind of range you are doing?

          (And in fact it could fold in FieldCacheRangeFilter too).

          Show
          Michael McCandless added a comment - NumericRangeQuery.newFloatRange(Float a, Float b, precisionStep) and so on. Could we also do this for a "term range"? Then, we could have a single RangeQuery that rewrites to the right impl based on what kind of range you are doing? (And in fact it could fold in FieldCacheRangeFilter too).
          Hide
          Michael McCandless added a comment -

          In Solr there are three different impls:

          Trie (of course)
          Text-only numbers (do not work with range queries, but can be used for sorting etc.)
          A binary encoding (also used by LocalLucene at the moment), that is sortable. This can be used for RangeQueries, but sorting is slow (because they have no parser, and at the time it was implemented, SortField had no parser support)

          Ahh OK, this is just Solr's pre-existing numeric field support. (I
          had thought you meant Solr had a different impl for Trie).

          The problem, because of backwards compatibility they need to be preserved (possibility to read old indexes).

          This is indeed quite a challenge. Actually is there anything in Trie
          that encodes which version of the format is indexed in a given
          segment? (So that if we do every change the indexed format, we can
          bump a version somewhere to keep back compat).

          Maybe we use a static factory instead of same Ctor. By this the name is different, but it just creates the correct instance of always the same class: NumericRangeQuery.newFloatRange(Float a, Float b, precisionStep) and so on. Same for the TokenStreams (and the Field?)

          That sounds like a good approach?

          > When you want to sort, pass the TrieUtils.FIELD_CACHE_LONG_PARSER
          > to your SortField

          Or add new SortField types.

          The problem with all this: For old indexes, we need some backwards compatibility. Ideally we would just create numeric fields in the new way and reuse e.g. SortField.INT for this. But this cannot be done. Or even, replace the FieldCache parsers by the trie ones. But this cannot be done at the moment.

          I wonder if we could handle this by adding a setting in FieldInfo?
          Ie, to record that "this numeric field was indexed as a trie". Then,
          when we need to get the parser for SortField.INT, we'd check the
          FieldInfo to see which parser to use. This could also handle
          back-compat, ie if we change the trie format being written we'd change
          the setting and segment merging would gradually uprade previously
          indexed fields.

          > I'd also like to rename RangeQuery to something else, with this
          > change. EG TermRangeQuery... to emphasize that you use it for
          > non-numbers. The javadocs of TermRangeQuery should point to
          > Int/LongRangeQuery as strongly preferred for numeric ranges.

          Cool. For the others, too (FieldCacheRangeQuery).

          Yes.

          Show
          Michael McCandless added a comment - In Solr there are three different impls: Trie (of course) Text-only numbers (do not work with range queries, but can be used for sorting etc.) A binary encoding (also used by LocalLucene at the moment), that is sortable. This can be used for RangeQueries, but sorting is slow (because they have no parser, and at the time it was implemented, SortField had no parser support) Ahh OK, this is just Solr's pre-existing numeric field support. (I had thought you meant Solr had a different impl for Trie). The problem, because of backwards compatibility they need to be preserved (possibility to read old indexes). This is indeed quite a challenge. Actually is there anything in Trie that encodes which version of the format is indexed in a given segment? (So that if we do every change the indexed format, we can bump a version somewhere to keep back compat). Maybe we use a static factory instead of same Ctor. By this the name is different, but it just creates the correct instance of always the same class: NumericRangeQuery.newFloatRange(Float a, Float b, precisionStep) and so on. Same for the TokenStreams (and the Field?) That sounds like a good approach? > When you want to sort, pass the TrieUtils.FIELD_CACHE_LONG_PARSER > to your SortField Or add new SortField types. The problem with all this: For old indexes, we need some backwards compatibility. Ideally we would just create numeric fields in the new way and reuse e.g. SortField.INT for this. But this cannot be done. Or even, replace the FieldCache parsers by the trie ones. But this cannot be done at the moment. I wonder if we could handle this by adding a setting in FieldInfo? Ie, to record that "this numeric field was indexed as a trie". Then, when we need to get the parser for SortField.INT, we'd check the FieldInfo to see which parser to use. This could also handle back-compat, ie if we change the trie format being written we'd change the setting and segment merging would gradually uprade previously indexed fields. > I'd also like to rename RangeQuery to something else, with this > change. EG TermRangeQuery... to emphasize that you use it for > non-numbers. The javadocs of TermRangeQuery should point to > Int/LongRangeQuery as strongly preferred for numeric ranges. Cool. For the others, too (FieldCacheRangeQuery). Yes.
          Hide
          Uwe Schindler added a comment -

          (Aside: I just noticed the code fragment in the javadocs for
          LongTrieTokenStream won't compile, because the setValue method is not
          available for TokenStream; the stream should be defined as
          LongTrieTokenStream, I think?; same with IntTrieTokenStream)

          I fixed this Thanks!

          If we rename the classes, should Solr stay with Trie (because there are different impls)?

          Well, Solr should decide

          But: why are there different impls for Solr?

          I only added this here, to know, that Solr already started to implement this. In Solr there are three different impls:

          • Trie (of course)
          • Text-only numbers (do not work with range queries, but can be used for sorting etc.)
          • A binary encoding (also used by LocalLucene at the moment), that is sortable. This can be used for RangeQueries, but sorting is slow (because they have no parser, and at the time it was implemented, SortField had no parser support)

          The problem, because of backwards compatibility they need to be preserved (possibility to read old indexes).

          I think separate classes for int, long, float, double is better.

          Two more... The problem, all these classes have exact the same impl internally and this is code duplication and hard to maintain. Maybe we use a static factory instead of same Ctor. By this the name is different, but it just creates the correct instance of always the same class: NumericRangeQuery.newFloatRange(Float a, Float b, precisionStep) and so on. Same for the TokenStreams (and the Field?)

          Ideally, one would simply use, say, LongNumericField (subclass of
          AbstractField) at indexing time, Lucene would "remember" this
          in the index (this is obviously missing today), and then when you
          sort, retrieve value, and create queries from QueryParser, all these
          places would "know" that this is a trie field and simply do the right
          thing, by default.

          For that we need the type information in the index and for that the new Field/Document classes. Hopefully Michael will get this working soonly.

          When you want to sort, pass the TrieUtils.FIELD_CACHE_LONG_PARSER
          to your SortField

          Or add new SortField types.

          The problem with all this: For old indexes, we need some backwards compatibility. Ideally we would just create numeric fields in the new way and reuse e.g. SortField.INT for this. But this cannot be done. Or even, replace the FieldCache parsers by the trie ones. But this cannot be done at the moment.

          I'd also like to rename RangeQuery to something else, with this
          change. EG TermRangeQuery... to emphasize that you use it for
          non-numbers. The javadocs of TermRangeQuery should point to
          Int/LongRangeQuery as strongly preferred for numeric ranges.

          Cool. For the others, too (FieldCacheRangeQuery).

          There is a lot more to decide, I will keep this issue open a little bit before starting to work to collect ideas!

          Show
          Uwe Schindler added a comment - (Aside: I just noticed the code fragment in the javadocs for LongTrieTokenStream won't compile, because the setValue method is not available for TokenStream; the stream should be defined as LongTrieTokenStream, I think?; same with IntTrieTokenStream) I fixed this Thanks! If we rename the classes, should Solr stay with Trie (because there are different impls)? Well, Solr should decide But: why are there different impls for Solr? I only added this here, to know, that Solr already started to implement this. In Solr there are three different impls: Trie (of course) Text-only numbers (do not work with range queries, but can be used for sorting etc.) A binary encoding (also used by LocalLucene at the moment), that is sortable. This can be used for RangeQueries, but sorting is slow (because they have no parser, and at the time it was implemented, SortField had no parser support) The problem, because of backwards compatibility they need to be preserved (possibility to read old indexes). I think separate classes for int, long, float, double is better. Two more... The problem, all these classes have exact the same impl internally and this is code duplication and hard to maintain. Maybe we use a static factory instead of same Ctor. By this the name is different, but it just creates the correct instance of always the same class: NumericRangeQuery.newFloatRange(Float a, Float b, precisionStep) and so on. Same for the TokenStreams (and the Field?) Ideally, one would simply use, say, LongNumericField (subclass of AbstractField) at indexing time, Lucene would "remember" this in the index (this is obviously missing today), and then when you sort, retrieve value, and create queries from QueryParser, all these places would "know" that this is a trie field and simply do the right thing, by default. For that we need the type information in the index and for that the new Field/Document classes. Hopefully Michael will get this working soonly. When you want to sort, pass the TrieUtils.FIELD_CACHE_LONG_PARSER to your SortField Or add new SortField types. The problem with all this: For old indexes, we need some backwards compatibility. Ideally we would just create numeric fields in the new way and reuse e.g. SortField.INT for this. But this cannot be done. Or even, replace the FieldCache parsers by the trie ones. But this cannot be done at the moment. I'd also like to rename RangeQuery to something else, with this change. EG TermRangeQuery... to emphasize that you use it for non-numbers. The javadocs of TermRangeQuery should point to Int/LongRangeQuery as strongly preferred for numeric ranges. Cool. For the others, too (FieldCacheRangeQuery). There is a lot more to decide, I will keep this issue open a little bit before starting to work to collect ideas!
          Hide
          Michael McCandless added a comment -

          I want to move it to core before release of 2.9

          +1!

          There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core?

          I prefer to not use "trie" in the names (package and classes)... that
          term very much describes what's under-the-hood in these classes (how
          they are implemented), whereas I think [generally] names should
          describe how the class is intended to be used. So I prefer
          "Long[Numeric]RangeQuery" over "LongTrieRangeQuery".

          I'd also like to rename RangeQuery to something else, with this
          change. EG TermRangeQuery... to emphasize that you use it for
          non-numbers. The javadocs of TermRangeQuery should point to
          Int/LongRangeQuery as strongly preferred for numeric ranges.

          Maybe the pairs of classes for indexing and searching should be moved into one class

          I think separate classes for int, long, float, double is better.

          TrieUtils move into o.a.l.util? or document or?

          Maybe document?

          Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes?

          That sounds good?

          If we rename the classes, should Solr stay with Trie (because there are different impls)?

          Well, Solr should decide

          But: why are there different impls for Solr?

          Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances?

          +1

          For a numeric field where one will sort or do range filtering, Trie*
          ought to be the default. But, unfortunately, the steps needed to make
          use of Trie* are numerous:

          • Add your field to your doc with the LongTrieTokenStream
          • When you want to sort, pass the TrieUtils.FIELD_CACHE_LONG_PARSER
            to your SortField
          • When you want to filter by range, instantiate
            LongTrieRangeFilter. You'll have to subclass QueryParser to do
            this for the right fields.
          • When you want to display values, you must also pass the trie parser
            when populating the FieldCache

          Ideally, one would simply use, say, LongNumericField (subclass of
          AbstractField) at indexing time, Lucene would "remember" this
          in the index (this is obviously missing today), and then when you
          sort, retrieve value, and create queries from QueryParser, all these
          places would "know" that this is a trie field and simply do the right
          thing, by default.

          (Aside: I just noticed the code fragment in the javadocs for
          LongTrieTokenStream won't compile, because the setValue method is not
          available for TokenStream; the stream should be defined as
          LongTrieTokenStream, I think?; same with IntTrieTokenStream)

          Show
          Michael McCandless added a comment - I want to move it to core before release of 2.9 +1! There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I prefer to not use "trie" in the names (package and classes)... that term very much describes what's under-the-hood in these classes (how they are implemented), whereas I think [generally] names should describe how the class is intended to be used. So I prefer "Long [Numeric] RangeQuery" over "LongTrieRangeQuery". I'd also like to rename RangeQuery to something else, with this change. EG TermRangeQuery... to emphasize that you use it for non-numbers. The javadocs of TermRangeQuery should point to Int/LongRangeQuery as strongly preferred for numeric ranges. Maybe the pairs of classes for indexing and searching should be moved into one class I think separate classes for int, long, float, double is better. TrieUtils move into o.a.l.util? or document or? Maybe document? Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? That sounds good? If we rename the classes, should Solr stay with Trie (because there are different impls)? Well, Solr should decide But: why are there different impls for Solr? Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances? +1 For a numeric field where one will sort or do range filtering, Trie* ought to be the default. But, unfortunately, the steps needed to make use of Trie* are numerous: Add your field to your doc with the LongTrieTokenStream When you want to sort, pass the TrieUtils.FIELD_CACHE_LONG_PARSER to your SortField When you want to filter by range, instantiate LongTrieRangeFilter. You'll have to subclass QueryParser to do this for the right fields. When you want to display values, you must also pass the trie parser when populating the FieldCache Ideally, one would simply use, say, LongNumericField (subclass of AbstractField) at indexing time, Lucene would "remember" this in the index (this is obviously missing today), and then when you sort, retrieve value, and create queries from QueryParser, all these places would "know" that this is a trie field and simply do the right thing, by default. (Aside: I just noticed the code fragment in the javadocs for LongTrieTokenStream won't compile, because the setValue method is not available for TokenStream; the stream should be defined as LongTrieTokenStream, I think?; same with IntTrieTokenStream)
          Hide
          Paul Elschot added a comment -

          From a bit of a distance:

          You could consider putting everything in o.a.l.trie .

          I'd prefer to have explicit class names containing Long, Int etc, and also containing Trie.

          I don't know the details of the tokenizing, but AbstractTrieField sounds just right.

          Show
          Paul Elschot added a comment - From a bit of a distance: You could consider putting everything in o.a.l.trie . I'd prefer to have explicit class names containing Long, Int etc, and also containing Trie. I don't know the details of the tokenizing, but AbstractTrieField sounds just right.
          Hide
          Earwin Burrfoot added a comment -

          Sudden thought. Leave it in contribs, you won't be bound by any other back-compat policies besides common sense.

          Show
          Earwin Burrfoot added a comment - Sudden thought. Leave it in contribs, you won't be bound by any other back-compat policies besides common sense.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development