Lucene - Core
  1. Lucene - Core
  2. LUCENE-2354

Convert NumericUtils and NumericTokenStream to use BytesRef instead of Strings/char[]

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      After LUCENE-2302, we should use TermToBytesRefAttribute to index using NumericTokenStream. This also should convert the whole NumericUtils to use BytesRef when converting numerics.

      1. LUCENE-2354.patch
        71 kB
        Uwe Schindler
      2. LUCENE-2354.patch
        66 kB
        Uwe Schindler
      3. LUCENE-2354.patch
        38 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Here a first preview patch.

          NumericUtils still contains lots of unused String-based methods, I think we should remove them, the class is expert-only and also experimental. Backwards compatibility is broken even with those backwards layers (as the split functions were changed to use BytesRefs. Also these backwards methods are simply slow now (as the byte[] is copied to char[] and vice-versa).

          The new NumericTokenStream now uses a special NumericTermAttribute, so possibly Filters coming later have access to shift value and so on. This attribute also implements the TermToBytesRefAttribute for the indexer. Please note: This attribute is a hack and does not support copyTo/clone/...., so you cannot put away tokens (which is not needed), but its still possible to add further attributes to numeric tokens (which is why the attribute is there).

          The NumericTokenStream backwards test was removed, because the new stream does no longer contain a TermAttribute, so the test always fails.

          TODO: A better inline-hashCode generation for the numeric-to-BytesRef transformation

          Show
          Uwe Schindler added a comment - Here a first preview patch. NumericUtils still contains lots of unused String-based methods, I think we should remove them, the class is expert-only and also experimental. Backwards compatibility is broken even with those backwards layers (as the split functions were changed to use BytesRefs. Also these backwards methods are simply slow now (as the byte[] is copied to char[] and vice-versa). The new NumericTokenStream now uses a special NumericTermAttribute, so possibly Filters coming later have access to shift value and so on. This attribute also implements the TermToBytesRefAttribute for the indexer. Please note: This attribute is a hack and does not support copyTo/clone/...., so you cannot put away tokens (which is not needed), but its still possible to add further attributes to numeric tokens (which is why the attribute is there). The NumericTokenStream backwards test was removed, because the new stream does no longer contain a TermAttribute, so the test always fails. TODO: A better inline-hashCode generation for the numeric-to-BytesRef transformation
          Hide
          Michael McCandless added a comment -

          NumericUtils still contains lots of unused String-based methods, I think we should remove them

          +1

          Patch looks good!

          NumericFields are the first thing to index terms directly as byte[]
          (ie not first going through char[]) in flex. But the encoding is
          unchanged right? (Ie only using 7 bits per byte, same as trunk).

          And you cutover to BytesRef TermsEnum API too – great. Presumably
          search perf would improve but only a tiny bit since NRQ visits so few
          terms?

          Show
          Michael McCandless added a comment - NumericUtils still contains lots of unused String-based methods, I think we should remove them +1 Patch looks good! NumericFields are the first thing to index terms directly as byte[] (ie not first going through char[]) in flex. But the encoding is unchanged right? (Ie only using 7 bits per byte, same as trunk). And you cutover to BytesRef TermsEnum API too – great. Presumably search perf would improve but only a tiny bit since NRQ visits so few terms?
          Hide
          Uwe Schindler added a comment - - edited

          But the encoding is unchanged right? (Ie only using 7 bits per byte, same as trunk).

          Yes. And i think we should keep it for now using 7 bit. Problems start when the sort order of terms is needed (which is the case for NRQ). As default in flex is the UTF-8 term comparator, it would not sort correctly for numeric fields with full 8 bits?

          By the way, the recently added backwards test checks that an old index with NumericField behaves as before! This is why I added a new zip file to TestBackwardCompatibility.

          And you cutover to BytesRef TermsEnum API too - great. Presumably search perf would improve but only a tiny bit since NRQ visits so few terms?

          I dont think you will notice a difference. A standard int range contains maybe 10 to 20 sub-ranges (at maximum), so converting between string and TermRef should not count. But the new implementation is more clean. In principle we could remove the whole char[]/String based API in NumericUtils - I only have to rewrite the tests and remove the NumericUtils test in backwards (as no longer applies then, too).

          Show
          Uwe Schindler added a comment - - edited But the encoding is unchanged right? (Ie only using 7 bits per byte, same as trunk). Yes. And i think we should keep it for now using 7 bit. Problems start when the sort order of terms is needed (which is the case for NRQ). As default in flex is the UTF-8 term comparator, it would not sort correctly for numeric fields with full 8 bits? By the way, the recently added backwards test checks that an old index with NumericField behaves as before! This is why I added a new zip file to TestBackwardCompatibility. And you cutover to BytesRef TermsEnum API too - great. Presumably search perf would improve but only a tiny bit since NRQ visits so few terms? I dont think you will notice a difference. A standard int range contains maybe 10 to 20 sub-ranges (at maximum), so converting between string and TermRef should not count. But the new implementation is more clean. In principle we could remove the whole char[]/String based API in NumericUtils - I only have to rewrite the tests and remove the NumericUtils test in backwards (as no longer applies then, too).
          Hide
          Uwe Schindler added a comment -

          Will work here the next days and rewrite the tests.

          One problem: Solr at the moment uses the deprecated string api for building a TermQuery. This should be replaced by a NRQ with upper==lower(inclusive), as this disables scoring, which is wrong for Numeric fields.

          Show
          Uwe Schindler added a comment - Will work here the next days and rewrite the tests. One problem: Solr at the moment uses the deprecated string api for building a TermQuery. This should be replaced by a NRQ with upper==lower(inclusive), as this disables scoring, which is wrong for Numeric fields.
          Hide
          Uwe Schindler added a comment -

          Here updated patch with cleaned up NumericUtils (no String methods anymore). For now I just commented them out, if we want to reactivate parts of them. Before release the methods should be removed.

          I changed all tests (and deactivated tests in backwards) using those String methods. Also rewrote the CartesianShapeFilter in contrib/spatial to use flex API (optimized for the one-term-case without OpenBitSet allocation). Also changed spatial tests to use NumericField.

          Show
          Uwe Schindler added a comment - Here updated patch with cleaned up NumericUtils (no String methods anymore). For now I just commented them out, if we want to reactivate parts of them. Before release the methods should be removed. I changed all tests (and deactivated tests in backwards) using those String methods. Also rewrote the CartesianShapeFilter in contrib/spatial to use flex API (optimized for the one-term-case without OpenBitSet allocation). Also changed spatial tests to use NumericField.
          Hide
          Michael McCandless added a comment -

          Patch looks good Uwe!

          Show
          Michael McCandless added a comment - Patch looks good Uwe!
          Hide
          Uwe Schindler added a comment -

          Updated patch with lots of javadocs cleanups and new getPrefixCodedXxxShift() methods. Also optimized some methods.

          I will commit this tomorrow!

          Show
          Uwe Schindler added a comment - Updated patch with lots of javadocs cleanups and new getPrefixCodedXxxShift() methods. Also optimized some methods. I will commit this tomorrow!
          Hide
          Uwe Schindler added a comment -

          Committed revision: 930821

          Show
          Uwe Schindler added a comment - Committed revision: 930821

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development