Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.9
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      According to the thread in java-dev (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to include my fast numerical range query implementation into lucene contrib-queries.

      I implemented (based on RangeFilter) another approach for faster
      RangeQueries, based on longs stored in index in a special format.

      The idea behind this is to store the longs in different precision in index
      and partition the query range in such a way, that the outer boundaries are
      search using terms from the highest precision, but the center of the search
      Range with lower precision. The implementation stores the longs in 8
      different precisions (using a class called TrieUtils). It also has support
      for Doubles, using the IEEE 754 floating-point "double format" bit layout
      with some bit mappings to make them binary sortable. The approach is used in
      rather big indexes, query times are even on low performance desktop
      computers <<100 ms for very big ranges on indexes with 500000 docs.

      I called this RangeQuery variant and format "TrieRangeRange" query because
      the idea looks like the well-known Trie structures (but it is not identical
      to real tries, but algorithms are related to it).

      1. LUCENE-1470.patch
        32 kB
        Uwe Schindler
      2. LUCENE-1470.patch
        39 kB
        Uwe Schindler
      3. LUCENE-1470.patch
        40 kB
        Uwe Schindler
      4. LUCENE-1470.patch
        48 kB
        Uwe Schindler
      5. LUCENE-1470.patch
        51 kB
        Uwe Schindler
      6. LUCENE-1470.patch
        56 kB
        Uwe Schindler
      7. LUCENE-1470.patch
        56 kB
        Uwe Schindler
      8. fixbuild-LUCENE-1470.patch
        0.7 kB
        Uwe Schindler
      9. fixbuild-LUCENE-1470.patch
        2 kB
        Uwe Schindler
      10. LUCENE-1470-readme.patch
        2 kB
        Uwe Schindler
      11. TrieUtils.java
        3 kB
        Yonik Seeley
      12. TrieUtils.java
        8 kB
        Uwe Schindler
      13. TrieUtils.java
        9 kB
        Uwe Schindler
      14. TrieRangeFilter.java
        11 kB
        Uwe Schindler
      15. TrieUtils.java
        11 kB
        Uwe Schindler
      16. TrieUtils.java
        13 kB
        Yonik Seeley
      17. trie.zip
        9 kB
        Uwe Schindler
      18. LUCENE-1470-revamp.patch
        115 kB
        Uwe Schindler
      19. LUCENE-1470-revamp.patch
        149 kB
        Uwe Schindler
      20. LUCENE-1470-revamp.patch
        153 kB
        Uwe Schindler
      21. LUCENE-1470-apichange.patch
        8 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          First version of a patch for contrib-queries. It includes my classes for package o.a.l.search.trie. The TrieRangeFilter was refactored to be a separate class (in contrast to my original implementation). The class is Java 1.4 compatible (like contrib-queries). JavaDocs were updated and a general information page for the package was given.

          A first test case to test the trie-encoded values was added. The filter and query tests must be written.

          Show
          Uwe Schindler added a comment - First version of a patch for contrib-queries. It includes my classes for package o.a.l.search.trie. The TrieRangeFilter was refactored to be a separate class (in contrast to my original implementation). The class is Java 1.4 compatible (like contrib-queries). JavaDocs were updated and a general information page for the package was given. A first test case to test the trie-encoded values was added. The filter and query tests must be written.
          Hide
          Uwe Schindler added a comment -

          See also the discussion in LUCENE-1461, as it inspired me to give my code to contrib.

          Show
          Uwe Schindler added a comment - See also the discussion in LUCENE-1461 , as it inspired me to give my code to contrib.
          Hide
          Uwe Schindler added a comment -

          fixed patch

          Show
          Uwe Schindler added a comment - fixed patch
          Hide
          Earwin Burrfoot added a comment - - edited

          Lucene (at least 2.3.2 version) cannot index Strings containing 0xFFFF characters, as it is used internally as EOS marker.
          I've implemented binary encoding similar to yours, but after hitting this issue with some converted Date, switched to base-2 15 encoding, leaving high bit always zero.

            public String toLucene(Long n) {
              if (n == null) {
                return null;
              }
          
              long u = n + Long.MAX_VALUE + 1;
          
              return new String(new char[]{
                  (char) ((u & 0xFFFE000000000000L) >> 49),
                  (char) ((u & 0x0001FFFC00000000L) >> 34),
                  (char) ((u & 0x00000003FFF80000L) >> 19),
                  (char) ((u & 0x000000000007FFF0L) >> 4),
                  (char) ((u & 0x000000000000000FL) << 11),
              });
            }
          
            public Long fromLucene(String string) {
              if (string == null) {
                return null;
              }
          
              return (((long) string.charAt(0)) << 49 |
                      ((long) string.charAt(1)) << 34 |
                      ((long) string.charAt(2)) << 19 |
                      ((long) string.charAt(3)) << 4 |
                      ((long) string.charAt(4)) >> 11
              ) - Long.MAX_VALUE - 1;
            }
          
          Show
          Earwin Burrfoot added a comment - - edited Lucene (at least 2.3.2 version) cannot index Strings containing 0xFFFF characters, as it is used internally as EOS marker. I've implemented binary encoding similar to yours, but after hitting this issue with some converted Date, switched to base-2 15 encoding, leaving high bit always zero. public String toLucene( Long n) { if (n == null ) { return null ; } long u = n + Long .MAX_VALUE + 1; return new String ( new char []{ ( char ) ((u & 0xFFFE000000000000L) >> 49), ( char ) ((u & 0x0001FFFC00000000L) >> 34), ( char ) ((u & 0x00000003FFF80000L) >> 19), ( char ) ((u & 0x000000000007FFF0L) >> 4), ( char ) ((u & 0x000000000000000FL) << 11), }); } public Long fromLucene( String string) { if (string == null ) { return null ; } return ((( long ) string.charAt(0)) << 49 | (( long ) string.charAt(1)) << 34 | (( long ) string.charAt(2)) << 19 | (( long ) string.charAt(3)) << 4 | (( long ) string.charAt(4)) >> 11 ) - Long .MAX_VALUE - 1; }
          Hide
          Earwin Burrfoot added a comment -

          Read your code closer. Silly me, I assumed you went with base-2 16.

          Show
          Earwin Burrfoot added a comment - Read your code closer. Silly me, I assumed you went with base-2 16 .
          Hide
          Uwe Schindler added a comment -

          No problem, I have choosen the two chars / byte approach for two reasons:

          a) it is simplier to generate lower precision terms (for 1-8 bytes precision) you just have to remove 2 chars per byte. in base 2^15, you only have 4 precisions and some more bits. The number of terms in my special TrieRangeFilter per precision would not be 256 values, it would be 32768, making it slower. Another posibility would be to cast each byte to one char (+offset), but see the next point:
          b) Java sometimes has strange string comparisons, and I did not want to walk into incompatiblities with String.compareTo()

          But one other point you noted in the other JIRA issue is hurting me: I did not try to sort the results to my combined, prefixed field since long time, and you are right, it is not possible, if all different precisions are in the same field. A earlier version of TrieRangeQuery used a suffix after the field-name, automatically added by the TrieUtils.addXXXXTrieDocumentField. I reverted removed this about two years ago, but never tried to sort a numerical field since then. I think, I have to write a bug report for my own project The question is, if this is included into contrib and I put a dependency in my project to this contrib pacakge, the index format of panFMP may change, which is not good.

          Just one question to other developers: Why is it not possible, to sort by a field with more than one term/doc, if you would restrict this to only use the first added term to the document as sort key in FieldCache?

          Show
          Uwe Schindler added a comment - No problem, I have choosen the two chars / byte approach for two reasons: a) it is simplier to generate lower precision terms (for 1-8 bytes precision) you just have to remove 2 chars per byte. in base 2^15, you only have 4 precisions and some more bits. The number of terms in my special TrieRangeFilter per precision would not be 256 values, it would be 32768, making it slower. Another posibility would be to cast each byte to one char (+offset), but see the next point: b) Java sometimes has strange string comparisons, and I did not want to walk into incompatiblities with String.compareTo() But one other point you noted in the other JIRA issue is hurting me: I did not try to sort the results to my combined, prefixed field since long time, and you are right, it is not possible, if all different precisions are in the same field. A earlier version of TrieRangeQuery used a suffix after the field-name, automatically added by the TrieUtils.addXXXXTrieDocumentField. I reverted removed this about two years ago, but never tried to sort a numerical field since then. I think, I have to write a bug report for my own project The question is, if this is included into contrib and I put a dependency in my project to this contrib pacakge, the index format of panFMP may change, which is not good. Just one question to other developers: Why is it not possible, to sort by a field with more than one term/doc, if you would restrict this to only use the first added term to the document as sort key in FieldCache?
          Hide
          Earwin Burrfoot added a comment -

          in base 2^15, you only have 4 precisions and some more bits

          We have slightly different approaches. Yours is universal, and mine requires hand-tuning. I have an abstract FastRangeFilter, which I extend, specifying functions that lower precision before encoding, thus I have any required amount of type/field-dependent precision levels. For further ease of use that one is extended by FastDateRangeFilter, which accepts an array of date parts, like

          {HOUR_OF_DAY, DAY_OF_MONTH}

          .
          That allows me to exploit known statistical properties of my requests/data, for example most date ranges are rolling day/week/month/3 months windows, or salaries which tend to be attracted to certain values.

          Java sometimes has strange string comparisons, and I did not want to walk into incompatiblities with String.compareTo()

          Fact that java strings are UCS-16 (UTF-16 minus special non-16bit characters) is written into java language specification, so you can trust String.compareTo() - anything that blows up there, is not Java(tm). Problems usually come from within libraries.

          I did not try to sort the results to my combined, prefixed field since long time, and you are right, it is not possible, if all different precisions are in the same field.

          Actually, right now I'm stealing borrowing your idea of storing various precisions in the same field. I have my own custom field cache, so broken sort can be fixed easily. Having everything in the same field allows you to do exactly one pass through TermEnum/TermDocs, and that is what I'm going to exploit

          Show
          Earwin Burrfoot added a comment - in base 2^15, you only have 4 precisions and some more bits We have slightly different approaches. Yours is universal, and mine requires hand-tuning. I have an abstract FastRangeFilter, which I extend, specifying functions that lower precision before encoding, thus I have any required amount of type/field-dependent precision levels. For further ease of use that one is extended by FastDateRangeFilter, which accepts an array of date parts, like {HOUR_OF_DAY, DAY_OF_MONTH} . That allows me to exploit known statistical properties of my requests/data, for example most date ranges are rolling day/week/month/3 months windows, or salaries which tend to be attracted to certain values. Java sometimes has strange string comparisons, and I did not want to walk into incompatiblities with String.compareTo() Fact that java strings are UCS-16 (UTF-16 minus special non-16bit characters) is written into java language specification, so you can trust String.compareTo() - anything that blows up there, is not Java(tm). Problems usually come from within libraries. I did not try to sort the results to my combined, prefixed field since long time, and you are right, it is not possible, if all different precisions are in the same field. Actually, right now I'm stealing borrowing your idea of storing various precisions in the same field. I have my own custom field cache, so broken sort can be fixed easily. Having everything in the same field allows you to do exactly one pass through TermEnum/TermDocs, and that is what I'm going to exploit
          Hide
          Uwe Schindler added a comment -

          You are right, my intentation was to make it universal by casting everything to unsigned long. For the sorting problem:
          With standard lucene, you can only sort when you have one term/field/doc. I am currently testing my code after modifying it to use a more compact version of the encoding:

          a) I do not encode half bytes, I encode full bytes per char. I still have 8 precisions after it, but code gets simplier and you have terms with length=8 chrs after encoding. After reading docs of String.compareTo() again, I understood the "Java-Way" of binary sort order and verified it with my test cases.

          b) the full precision is stored in the user-given document field name, all 7 lower precisions are prefixed (as before) and put into a "helper" field with suffix "#trie" after the name. This field is only indexed, not stored or anything else. This field is only used in TrieRangeFilter for the trie algorithm.

          For my project panFMP it is not such a big problem, if you inform the "few" users using it (I know all of them) and ask them to do a index rebuild (for that a tool is available). But the encoding format of this contrib package should be fixed and discussed before it is released!

          Show
          Uwe Schindler added a comment - You are right, my intentation was to make it universal by casting everything to unsigned long. For the sorting problem: With standard lucene, you can only sort when you have one term/field/doc. I am currently testing my code after modifying it to use a more compact version of the encoding: a) I do not encode half bytes, I encode full bytes per char. I still have 8 precisions after it, but code gets simplier and you have terms with length=8 chrs after encoding. After reading docs of String.compareTo() again, I understood the "Java-Way" of binary sort order and verified it with my test cases. b) the full precision is stored in the user-given document field name, all 7 lower precisions are prefixed (as before) and put into a "helper" field with suffix "#trie" after the name. This field is only indexed, not stored or anything else. This field is only used in TrieRangeFilter for the trie algorithm. For my project panFMP it is not such a big problem, if you inform the "few" users using it (I know all of them) and ask them to do a index rebuild (for that a tool is available). But the encoding format of this contrib package should be fixed and discussed before it is released!
          Hide
          Earwin Burrfoot added a comment -

          But the encoding format of this contrib package should be fixed and discussed before it is released!

          Make it pluggable? I have my encoding separate from range filters and precision loss, you can specify fields to use, etc-etc. Anyway, people can always copy your code and tweak it to their specific needs. For contribs, imho, the idea is much more valuable than actual implementation details.

          Show
          Earwin Burrfoot added a comment - But the encoding format of this contrib package should be fixed and discussed before it is released! Make it pluggable? I have my encoding separate from range filters and precision loss, you can specify fields to use, etc-etc. Anyway, people can always copy your code and tweak it to their specific needs. For contribs, imho, the idea is much more valuable than actual implementation details.
          Hide
          Michael McCandless added a comment -

          > Why is it not possible, to sort by a field with more than one
          > term/doc, if you would restrict this to only use the first added term
          > to the document as sort key in FieldCache?

          This was discussed, but not resolved, in LUCENE-1372.

          If we wanted to allow "use the first term" as a policy, we'd have to
          fix how FieldCacheImpl computes the StringIndex: currently it gets a
          TermEnum, walks through all terms, for each term walks through all
          docs, and records docID->termNumber as well as termNumber->termText
          arrays, un-inverting the field. I guess we could get a TermPositions,
          instead, and only pay attention to position=0 terms, and then only
          increment t if there were any docs with position=0 occurrences of the
          term.

          Show
          Michael McCandless added a comment - > Why is it not possible, to sort by a field with more than one > term/doc, if you would restrict this to only use the first added term > to the document as sort key in FieldCache? This was discussed, but not resolved, in LUCENE-1372 . If we wanted to allow "use the first term" as a policy, we'd have to fix how FieldCacheImpl computes the StringIndex: currently it gets a TermEnum, walks through all terms, for each term walks through all docs, and records docID->termNumber as well as termNumber->termText arrays, un-inverting the field. I guess we could get a TermPositions, instead, and only pay attention to position=0 terms, and then only increment t if there were any docs with position=0 occurrences of the term.
          Hide
          Michael McCandless added a comment -

          It seems like there are two things you might want to make "pluggable":

          • Which hierarchical ranges to create & logic that creates them for
            a given value. For certain types (dates) I think there's one
            obvious set of ranges (year, quarter, month, day, etc.). For
            something like "price" it's probably more app dependent, eg if you
            present certain filters/facets to the user.
          • What label is attached to each range, and the correspond decoder
            to map that label back into its upper/lower values.

          So we could provide good defaults for eg Date. The labels would
          ideally be things like 2007, 2008 and "Jan 2008", "Feb 2008", etc. I
          think? Hmm – except for the need to sort them "properly".

          We really should change Lucene's terms so that they are opaque objects
          (as KS/Lucy is doing). This way you could store a Term like "Feb
          2008" in the index, along with details like "this is a generated
          ranged datetime term" and the numeric date details ie 02/2008, and
          you'd just have to export a compareTo to Lucene.

          Show
          Michael McCandless added a comment - It seems like there are two things you might want to make "pluggable": Which hierarchical ranges to create & logic that creates them for a given value. For certain types (dates) I think there's one obvious set of ranges (year, quarter, month, day, etc.). For something like "price" it's probably more app dependent, eg if you present certain filters/facets to the user. What label is attached to each range, and the correspond decoder to map that label back into its upper/lower values. So we could provide good defaults for eg Date. The labels would ideally be things like 2007, 2008 and "Jan 2008", "Feb 2008", etc. I think? Hmm – except for the need to sort them "properly". We really should change Lucene's terms so that they are opaque objects (as KS/Lucy is doing). This way you could store a Term like "Feb 2008" in the index, along with details like "this is a generated ranged datetime term" and the numeric date details ie 02/2008, and you'd just have to export a compareTo to Lucene.
          Hide
          Uwe Schindler added a comment -

          An updated version of the patch. The format of encoded terms is incompatible to the previous patch.

          Changed:

          • Each byte of the 8 byte longs is encoded as 1 char (see coments before) -> more compact
          • each byte is shifted by 0x30 when converted to char (e.g. value 0x05 is saved as char 0x35)
          • the lower precision terms are stored in a separate field (using the original fieldname+"#trie"). With that it is possible to sort the original field. The mentioned helper field is only index. The prefix for the lower precisions starts at char 0x20. By separating prefix and data ranges, later merging the separate and the original field is easily possible. Because of this lower precisions would be listed before higher precisions or the full precision in the ordered term list.
          • fix a small issue with checking a term two times (between two ranges)
          • added test for TrieRangeQuery with a full-bounded and half-open range.

          I will say something about Mike's suggestions tomorrow, now it is to late!

          Show
          Uwe Schindler added a comment - An updated version of the patch. The format of encoded terms is incompatible to the previous patch. Changed: Each byte of the 8 byte longs is encoded as 1 char (see coments before) -> more compact each byte is shifted by 0x30 when converted to char (e.g. value 0x05 is saved as char 0x35) the lower precision terms are stored in a separate field (using the original fieldname+"#trie"). With that it is possible to sort the original field. The mentioned helper field is only index. The prefix for the lower precisions starts at char 0x20. By separating prefix and data ranges, later merging the separate and the original field is easily possible. Because of this lower precisions would be listed before higher precisions or the full precision in the ordered term list. fix a small issue with checking a term two times (between two ranges) added test for TrieRangeQuery with a full-bounded and half-open range. I will say something about Mike's suggestions tomorrow, now it is to late!
          Hide
          Uwe Schindler added a comment -

          I wanted to go to bed, but it is the same every time, cannot sleep

          The discussions about TrieRangeQuery are the same like two years ago when I prepared the original paper cited in the documentation with my mentor/colleague Michael Diepenbroek (the second author on the paper) – here my comments about the suggestions to Mike McCandless and Earwin, after looking into my notices, I can explain:

          When looking for the first time on the algorithm to split the range, the ideas of Mike and Earwin are very clear: You are both interested in Dates and both of you have the same idea, how to divide them into parts with different "precisions". For dates it is rather simple: If you have a range between March 1999 and August 2006, the two precisions "full years" and "months" are very easy to understand when creating the ranges. You would have three range queries: Mar'1999 ... Dec'1999, 2000 ... 2005, Jan'2006 ... Aug'2006 very easy. You have to visit 10+6+8=24 terms. Simple. The maximum number of terms would be 12+12+numberOfTotalPossibleYears. This is easy to understand and very effective.

          My approach in principle does the same, but the values are simply devided into bytes. There is no problem with it to store the number of milliseconds since 1970-01-01 in different precisions. The total range of this long would be very very large (I do not want to calculate now how many years BC the value -2^-31ms would be). If you have a range like the one above you have e.g. longs like 0x0000323113422134 TO 0x0000325113422174. You see the frront is identical, the center of the range maybe only goes to the precision of 12 digits. I said the long ist divided into 8 bytes, so you would use only the end of the range in highest precsision, the center in lower. As the lowest precsison does not change (it would only change if you are thousands of years from start to end), my algorithm would not use it, it would only go downto the precision that exists. The if-clause "(length==1 || minShort.compareTo(maxShort)>=0)" is responsible for that. If the lower and upper end of the next lower precision range is zero or inverted, it is not further calculated.

          Because of this shortcut it does not make a difference if the term values are distributed in the index very far from each other or are all on one place (the algorithm works as good, if your index contains prices between 1$ and 100$ or if you have double that give ages of millions of years and you want to do a range select on them. Independent on the selected range the number of terms to be visited is limited by my algorithm to the number of 3825 = (a maximum of 255 terms in the lowermost range)(a maximum of 255 terms in the uppermost range)(255)(255)......+(255 in center of range). This is a hard limit, and this limit can only be reached if you would have an index with 2^64-2 documents, that have 2^64-2 different long values and you have a range that selects all of them (only in this case you have to visit 3825 terms!).

          You can test a little bit with the new test case in the latest patch (change some numbers) and uncomment the System.out.println() for the number of terms seen. You will see, even with large indexes, the number of terms in mostly about 400. Only very very seldom more terms were visited (I have seen in our 500,000 docs index only one case with 750 terms, which can happen if you have a non-homogenous dispersion of values that has a local maximum near one of the range bounds). Important: The TrieRangeQuery only makes sense if your index as a minimum of about 500 docs, if there are less, you mostly have to visit all terms. Just play with the test case and do some statistics!

          Because of all this, I see no real need for giving the user a customized way to convert the values into Terms and calculate lower precisions of it. The above code works performant (you have tested it yesterday with our index) even for doubles, that are very seldom homogenous distributed through the whole LONG range if stored as IEEE-754 bitmap. So TrieRangeQuery in the current version is effective for longs, doubles and Dates (as they are longs). The only problem would be fixed comma numbers with many digits (e.g. BigDecimals), they cannot simply casted to longs. For these type of numbers, TrieUtils could be extended to generate longer strings.

          To space usage: You are right, if you only want to have ranges on year/month, a long is too much. But keep in mind, the length of terms is not so important for index size, important is the number of different terms in the TermEnum. So I see no problem with storing integers and doubles and dates as a 64bit value in index - its only little bit extra term length with "unused bits".

          Show
          Uwe Schindler added a comment - I wanted to go to bed, but it is the same every time, cannot sleep The discussions about TrieRangeQuery are the same like two years ago when I prepared the original paper cited in the documentation with my mentor/colleague Michael Diepenbroek (the second author on the paper) – here my comments about the suggestions to Mike McCandless and Earwin, after looking into my notices, I can explain: When looking for the first time on the algorithm to split the range, the ideas of Mike and Earwin are very clear: You are both interested in Dates and both of you have the same idea, how to divide them into parts with different "precisions". For dates it is rather simple: If you have a range between March 1999 and August 2006, the two precisions "full years" and "months" are very easy to understand when creating the ranges. You would have three range queries: Mar'1999 ... Dec'1999, 2000 ... 2005, Jan'2006 ... Aug'2006 very easy. You have to visit 10+6+8=24 terms. Simple. The maximum number of terms would be 12+12+numberOfTotalPossibleYears. This is easy to understand and very effective. My approach in principle does the same, but the values are simply devided into bytes. There is no problem with it to store the number of milliseconds since 1970-01-01 in different precisions. The total range of this long would be very very large (I do not want to calculate now how many years BC the value -2^-31ms would be). If you have a range like the one above you have e.g. longs like 0x0000323113422134 TO 0x0000325113422174. You see the frront is identical, the center of the range maybe only goes to the precision of 12 digits. I said the long ist divided into 8 bytes, so you would use only the end of the range in highest precsision, the center in lower. As the lowest precsison does not change (it would only change if you are thousands of years from start to end), my algorithm would not use it, it would only go downto the precision that exists. The if-clause "(length==1 || minShort.compareTo(maxShort)>=0)" is responsible for that. If the lower and upper end of the next lower precision range is zero or inverted, it is not further calculated. Because of this shortcut it does not make a difference if the term values are distributed in the index very far from each other or are all on one place (the algorithm works as good, if your index contains prices between 1$ and 100$ or if you have double that give ages of millions of years and you want to do a range select on them. Independent on the selected range the number of terms to be visited is limited by my algorithm to the number of 3825 = (a maximum of 255 terms in the lowermost range) (a maximum of 255 terms in the uppermost range) (255) (255) ......+(255 in center of range). This is a hard limit, and this limit can only be reached if you would have an index with 2^64-2 documents, that have 2^64-2 different long values and you have a range that selects all of them (only in this case you have to visit 3825 terms!). You can test a little bit with the new test case in the latest patch (change some numbers) and uncomment the System.out.println() for the number of terms seen. You will see, even with large indexes, the number of terms in mostly about 400. Only very very seldom more terms were visited (I have seen in our 500,000 docs index only one case with 750 terms, which can happen if you have a non-homogenous dispersion of values that has a local maximum near one of the range bounds). Important: The TrieRangeQuery only makes sense if your index as a minimum of about 500 docs, if there are less, you mostly have to visit all terms. Just play with the test case and do some statistics! Because of all this, I see no real need for giving the user a customized way to convert the values into Terms and calculate lower precisions of it. The above code works performant (you have tested it yesterday with our index) even for doubles, that are very seldom homogenous distributed through the whole LONG range if stored as IEEE-754 bitmap. So TrieRangeQuery in the current version is effective for longs, doubles and Dates (as they are longs). The only problem would be fixed comma numbers with many digits (e.g. BigDecimals), they cannot simply casted to longs. For these type of numbers, TrieUtils could be extended to generate longer strings. To space usage: You are right, if you only want to have ranges on year/month, a long is too much. But keep in mind, the length of terms is not so important for index size, important is the number of different terms in the TermEnum. So I see no problem with storing integers and doubles and dates as a 64bit value in index - its only little bit extra term length with "unused bits".
          Hide
          Paul Elschot added a comment - - edited

          Independent on the selected range the number of terms to be visited is limited by my algorithm to the number of 3825 = (a maximum of 255 terms in the lowermost range)(a maximum of 255 terms in the uppermost range)(255)(255)......+(255 in center of range).

          The code uses a trie factor of 256, or 8 bits in a long of 64 bits.
          Would it be possible to use lower values for this trie factor, like 16 (4 bits) or even 4 (2 bits)?
          In such cases the (rough) maximum number of terms for a single ended range becomes smaller:
          (256-1) * (64/8) = 255 * 8 = 2040
          (16-1) * (64/4) = 15 * 16 = 240
          (4-1) * (64/2) = 3 * 32 = 96
          This reduction comes at the cost doubling or quadrupling the number of the indexed terms in the lower precision field.

          The number of characters in the lower precision terms is not really relevant in the term index, because terms are indexed with common prefixes. Therefore in these cases one could just use a character to encode the 4 bits or 2 bits.

          So the question is would it be possible to specify the trie factor when building and using the index?

          Show
          Paul Elschot added a comment - - edited Independent on the selected range the number of terms to be visited is limited by my algorithm to the number of 3825 = (a maximum of 255 terms in the lowermost range)(a maximum of 255 terms in the uppermost range)(255)(255)......+(255 in center of range). The code uses a trie factor of 256, or 8 bits in a long of 64 bits. Would it be possible to use lower values for this trie factor, like 16 (4 bits) or even 4 (2 bits)? In such cases the (rough) maximum number of terms for a single ended range becomes smaller: (256-1) * (64/8) = 255 * 8 = 2040 (16-1) * (64/4) = 15 * 16 = 240 (4-1) * (64/2) = 3 * 32 = 96 This reduction comes at the cost doubling or quadrupling the number of the indexed terms in the lower precision field. The number of characters in the lower precision terms is not really relevant in the term index, because terms are indexed with common prefixes. Therefore in these cases one could just use a character to encode the 4 bits or 2 bits. So the question is would it be possible to specify the trie factor when building and using the index?
          Hide
          Uwe Schindler added a comment -

          I think, this would easyly be possible. Refactoring some code parts, special implementations of the encoder and incrementTrieCoded/decrementTrieCoded would be possible.
          A possibility to make this configuraeable would be to use an instance of TrieUtils, that takes the trie factor as constructor argument. The user would then encode/decode all his values using the instance. TrieRangeFilter would also get an instance of the encoder and use it to calculate the prefix terms and so on.

          The number of characters in the lower precision terms is not really relevant in the term index, because terms are indexed with common prefixes. Therefore in these cases one could just use a character to encode the 4 bits or 2 bits.

          So the question is would it be possible to specify the trie factor when building and using the index?

          Yes thats good for jumping to the correct term to start in the range query. The problem with shorter trie factors would be, that for each precision (e.g. 4 bits, 2 bits) you need one full char in the encoded variant. As length is not a problem for terms, I think the common prefixed cannot be used so effective (a lot of terms with two low-cardinality chars at the beginning).

          Show
          Uwe Schindler added a comment - I think, this would easyly be possible. Refactoring some code parts, special implementations of the encoder and incrementTrieCoded/decrementTrieCoded would be possible. A possibility to make this configuraeable would be to use an instance of TrieUtils, that takes the trie factor as constructor argument. The user would then encode/decode all his values using the instance. TrieRangeFilter would also get an instance of the encoder and use it to calculate the prefix terms and so on. The number of characters in the lower precision terms is not really relevant in the term index, because terms are indexed with common prefixes. Therefore in these cases one could just use a character to encode the 4 bits or 2 bits. So the question is would it be possible to specify the trie factor when building and using the index? Yes thats good for jumping to the correct term to start in the range query. The problem with shorter trie factors would be, that for each precision (e.g. 4 bits, 2 bits) you need one full char in the encoded variant. As length is not a problem for terms, I think the common prefixed cannot be used so effective (a lot of terms with two low-cardinality chars at the beginning).
          Hide
          Uwe Schindler added a comment -

          In such cases the (rough) maximum number of terms for a single ended range becomes smaller:

          Are you sure? My algorithm is more effective on single ended ranges (the number of terms is half the size of the terms of a two-ended range). When the lower bound is open, it is enough to use the lowest precision in the lower half of the range beginning (this special case is done in splitRange()). Higher precision terms only occur at bthe higher bound. This is why I use 255 and not 256 for my formula of the maximum number of terms (a range of all 256 terms in higher precision could easily reduced to the lower precision).

          Show
          Uwe Schindler added a comment - In such cases the (rough) maximum number of terms for a single ended range becomes smaller: Are you sure? My algorithm is more effective on single ended ranges (the number of terms is half the size of the terms of a two-ended range). When the lower bound is open, it is enough to use the lowest precision in the lower half of the range beginning (this special case is done in splitRange()). Higher precision terms only occur at bthe higher bound. This is why I use 255 and not 256 for my formula of the maximum number of terms (a range of all 256 terms in higher precision could easily reduced to the lower precision).
          Hide
          Paul Elschot added a comment -

          I simply used the maximum number of terms at each level times the number of levels.
          For a trie factor of 2**b (using b bits), and a long of 64 bits, that is:
          (2**b - 1) * (64/b)
          The actual values for b = 8, b = 4 and b = 2 are shown above.

          Show
          Paul Elschot added a comment - I simply used the maximum number of terms at each level times the number of levels. For a trie factor of 2**b (using b bits), and a long of 64 bits, that is: (2**b - 1) * (64/b) The actual values for b = 8, b = 4 and b = 2 are shown above.
          Hide
          Uwe Schindler added a comment -

          You are right, thats correct! I misunderstood your answer. The number of terms for single ended ranges is about half of the terms for a two-ended range.

          Show
          Uwe Schindler added a comment - You are right, thats correct! I misunderstood your answer. The number of terms for single ended ranges is about half of the terms for a two-ended range.
          Hide
          Paul Elschot added a comment -

          So going from 8 bits to 4 bits will reduce the number of filter terms by a factor 8.5, for a cost of doubling the number of indexed terms. I'd like to see actual performance figures, but on the face of it I'd rather use 4 bits.

          Show
          Paul Elschot added a comment - So going from 8 bits to 4 bits will reduce the number of filter terms by a factor 8.5, for a cost of doubling the number of indexed terms. I'd like to see actual performance figures, but on the face of it I'd rather use 4 bits.
          Hide
          Uwe Schindler added a comment -

          Just a update of the patch, before I want to implement Paul's suggestions on trie factor.

          The current patch has some performance improvements by avoid seeking back and forth in IndexReader's TermEnum. IndexReader's TermEnum may also better use caching. The parts of the range, that use Terms, that come earlier in the TermEnum are done first. The order is now:

          • Highest precision (because the field name of highest precision is not suffixed, and so the terms come earlier, "fieldname"<"fieldname#trie")
          • Lower precision starting with the lowest one. The prefix of the lowest precision is the smallest (0x20) and goes up to 0x27

          When looking into SegmentTermEnum's code, I realized that IndexReader.terms(Term t) is faster than only getting the complete TermEnum and then seekTo(Term). Why this difference? I first wanted to change my code to use only one instance (like with the TermDocs) of the TermEnum through the wohle range split prcess and seek the enum for each range, but I dropped that change.

          Further improvements of that patch are more comments and a cleaner (and more elegant) code in TrieUtils (without using a StringBuffer).

          Show
          Uwe Schindler added a comment - Just a update of the patch, before I want to implement Paul's suggestions on trie factor. The current patch has some performance improvements by avoid seeking back and forth in IndexReader's TermEnum. IndexReader's TermEnum may also better use caching. The parts of the range, that use Terms, that come earlier in the TermEnum are done first. The order is now: Highest precision (because the field name of highest precision is not suffixed, and so the terms come earlier, "fieldname"<"fieldname#trie") Lower precision starting with the lowest one. The prefix of the lowest precision is the smallest (0x20) and goes up to 0x27 When looking into SegmentTermEnum's code, I realized that IndexReader.terms(Term t) is faster than only getting the complete TermEnum and then seekTo(Term). Why this difference? I first wanted to change my code to use only one instance (like with the TermDocs) of the TermEnum through the wohle range split prcess and seek the enum for each range, but I dropped that change. Further improvements of that patch are more comments and a cleaner (and more elegant) code in TrieUtils (without using a StringBuffer).
          Hide
          Michael McCandless added a comment -

          Sheesh, doesn't anyone sleep anymore?

          When looking into SegmentTermEnum's code, I realized that IndexReader.terms(Term t) is faster than only getting the complete TermEnum and then seekTo(Term).

          Did you mean TermEnum.skipTo? See LUCENE-1437.

          Show
          Michael McCandless added a comment - Sheesh, doesn't anyone sleep anymore? When looking into SegmentTermEnum's code, I realized that IndexReader.terms(Term t) is faster than only getting the complete TermEnum and then seekTo(Term). Did you mean TermEnum.skipTo? See LUCENE-1437 .
          Hide
          Michael McCandless added a comment -

          My approach in principle does the same, but the values are simply devided into bytes.

          I do very much like that your approach requires no "application or type-specific knowledge" on what might make good ranges. Ie, it's generic, and so any type that can be cast into long and back w/o losing any information, can make use of this.

          Show
          Michael McCandless added a comment - My approach in principle does the same, but the values are simply devided into bytes. I do very much like that your approach requires no "application or type-specific knowledge" on what might make good ranges. Ie, it's generic, and so any type that can be cast into long and back w/o losing any information, can make use of this.
          Hide
          Uwe Schindler added a comment - - edited

          Did you mean TermEnum.skipTo? See LUCENE-1437.

          Yes exactly that, I wrote seekTop(), but meant skipTo(). The patch in this issue is only a faster scanning approach, but the best would be to make skipTo() use the same logic like SegmentInfoReader.terms that uses seeking on disk and so on, but this comment would be better for the above issue.

          I am fine with calling IndexReader.terms(Term) to use the cache and faster seeking. The cost of creating new instances of TermEnums is less than doing disk reads.

          Show
          Uwe Schindler added a comment - - edited Did you mean TermEnum.skipTo? See LUCENE-1437 . Yes exactly that, I wrote seekTop(), but meant skipTo(). The patch in this issue is only a faster scanning approach, but the best would be to make skipTo() use the same logic like SegmentInfoReader.terms that uses seeking on disk and so on, but this comment would be better for the above issue. I am fine with calling IndexReader.terms(Term) to use the cache and faster seeking. The cost of creating new instances of TermEnums is less than doing disk reads.
          Hide
          Paul Elschot added a comment -

          When the trie factor can be chosen, it might make sense to encode it in the trie field name, something like:

          fieldName#trieNN

          with NN the trie factor or its number of bits.

          Show
          Paul Elschot added a comment - When the trie factor can be chosen, it might make sense to encode it in the trie field name, something like: fieldName#trieNN with NN the trie factor or its number of bits.
          Hide
          Uwe Schindler added a comment - - edited

          Here is a new version of the patch, supporting 8bit, 4bit and 2bit trie variant. The API is similar, but instead of using TrieUtils as static class, you can choose 3 singleton converters: e.g. TrieUtils.VARIANT_8BIT.convertSomething(). TrieRangeQuery and TrieRangeFilter both accept a parameter for choosing the variant. A default can be set with a static TrieRangeFilter.setDefaultTrieVariant(...) and the corresponding getter.

          To Paul's suggestion: You could put the trie variant into the filed name, but the main fieldname, where the full precision value is indexed/stored/whatever, has no suffix. This would make it inconsistent.

          In my opinion, one should choose the same trie variant when indexing values and doing range queries. It is the same like choosing another analyzer for indexing and searching. Maybe setting the default trie variant with this static method maybe better hosted in TrieUtils, but this is room for discussion.

          Now it is time to do some performance tests with really big indexes comparing the trie variants Does anybody has a synthetic benchmarker that can be easily used for this? The test case is not so interesting (uses RAMDirectory). I could reindex our PANGAEA index for performance testing in real world (doubles, dates).

          The testcase (if you uncomment the printout in TrieRangeFilter of number of terms) clearly shows the lower number of terms visited for 4bit or 2bit. Formulas are in the package description.

          In my opinion 4bit is a good alternative (about 3 times space requirement for about 8.5x less terms), but the impact of 2bit is to low for the about 6 times larger space.

          Show
          Uwe Schindler added a comment - - edited Here is a new version of the patch, supporting 8bit, 4bit and 2bit trie variant. The API is similar, but instead of using TrieUtils as static class, you can choose 3 singleton converters: e.g. TrieUtils.VARIANT_8BIT.convertSomething(). TrieRangeQuery and TrieRangeFilter both accept a parameter for choosing the variant. A default can be set with a static TrieRangeFilter.setDefaultTrieVariant(...) and the corresponding getter. To Paul's suggestion: You could put the trie variant into the filed name, but the main fieldname, where the full precision value is indexed/stored/whatever, has no suffix. This would make it inconsistent. In my opinion, one should choose the same trie variant when indexing values and doing range queries. It is the same like choosing another analyzer for indexing and searching. Maybe setting the default trie variant with this static method maybe better hosted in TrieUtils, but this is room for discussion. Now it is time to do some performance tests with really big indexes comparing the trie variants Does anybody has a synthetic benchmarker that can be easily used for this? The test case is not so interesting (uses RAMDirectory). I could reindex our PANGAEA index for performance testing in real world (doubles, dates). The testcase (if you uncomment the printout in TrieRangeFilter of number of terms) clearly shows the lower number of terms visited for 4bit or 2bit. Formulas are in the package description. In my opinion 4bit is a good alternative (about 3 times space requirement for about 8.5x less terms), but the impact of 2bit is to low for the about 6 times larger space.
          Hide
          Mark Harwood added a comment -

          A note of caution - I noticed when moving from Lucene 2.3 to 2.4 that my similar scheme for encoding information meant that I couldn't encode information using byte arrays using bytes with values > 216.
          The changes (I think in Lucene-510) introduced some code that modified the way the bytes were written/read and corrupted my encoding.

          Not sure if your proposed approach is prone to this or if anyone can cast further light on these encoding issues.

          Good to see this making its way into Lucene, Uwe.

          Show
          Mark Harwood added a comment - A note of caution - I noticed when moving from Lucene 2.3 to 2.4 that my similar scheme for encoding information meant that I couldn't encode information using byte arrays using bytes with values > 216. The changes (I think in Lucene-510) introduced some code that modified the way the bytes were written/read and corrupted my encoding. Not sure if your proposed approach is prone to this or if anyone can cast further light on these encoding issues. Good to see this making its way into Lucene, Uwe.
          Hide
          Uwe Schindler added a comment -

          I do not know what you had done in your code. Did you directly converted the byte[] arrays to a string, e.g. by using new String(byte[],charset) or without charset (in this case Java would use whats actual, but maybe its UTF-8)? Then, the character 216 (0xd8) would be interpreted by new String() as an UTF-8/16 sequence or whatever and map to some unknown char. UnicodeUtils, on the other hand, encodes the java chars to UTF-8 when storing in index, but does not like chars >0xd800 and replaces them by the replacement char for unknown chars. And this is a modification, as the char >0xd800 is not valid (see source of UnicodeUtils). and 0xd800 looks like 0xd8 (==216).

          My code does not transform a byte[] directly to a string, it creates a 16 bit standard java char of each byte with some offset. As the trie code only produces bytes between 0-255 and it adds a offset of 0x30 to it, the range of chars is 0x30..0x12f. This range is unicode safe and can be easily encoded to UTF-8. But I will make a explicit testcase for that tomorrow!

          Maybe your problem was the mentioned mis-use of directly generating strings from byte arrays using unknown/incorrect charset. Keep me informed!

          By the way Earwin Burrfoot: are you sure, your 15bit encoding works with the latest Lucene version, maybe this affects you, too?

          Show
          Uwe Schindler added a comment - I do not know what you had done in your code. Did you directly converted the byte[] arrays to a string, e.g. by using new String(byte[],charset) or without charset (in this case Java would use whats actual, but maybe its UTF-8)? Then, the character 216 (0xd8) would be interpreted by new String() as an UTF-8/16 sequence or whatever and map to some unknown char. UnicodeUtils, on the other hand, encodes the java chars to UTF-8 when storing in index, but does not like chars >0xd800 and replaces them by the replacement char for unknown chars. And this is a modification, as the char >0xd800 is not valid (see source of UnicodeUtils). and 0xd800 looks like 0xd8 (==216). My code does not transform a byte[] directly to a string, it creates a 16 bit standard java char of each byte with some offset. As the trie code only produces bytes between 0-255 and it adds a offset of 0x30 to it, the range of chars is 0x30..0x12f. This range is unicode safe and can be easily encoded to UTF-8. But I will make a explicit testcase for that tomorrow! Maybe your problem was the mentioned mis-use of directly generating strings from byte arrays using unknown/incorrect charset. Keep me informed! By the way Earwin Burrfoot: are you sure, your 15bit encoding works with the latest Lucene version, maybe this affects you, too?
          Hide
          Mark Harwood added a comment -

          Yep, I was simply using new String(byte[])

          Show
          Mark Harwood added a comment - Yep, I was simply using new String(byte[])
          Hide
          Nadav Har'El added a comment -

          Hi, I just wanted to comment that a similar method was proposed in 2005 by several researchers in IBM (all of them have moved to Yahoo by now, actually). See "Inverted Index Support for Numeric Search", by M Fontoura, R Lempel, R Qi, J Zien
          http://webcourse.cs.technion.ac.il/236620/Winter2006-2007/hw/WCFiles/paramsearch.pdf
          The paper contains some discussion on the tradeoffs between the number of layers, number of terms, index size, and so on, so it might contain some valuable ideas even if you don't use their scheme exactly.

          Show
          Nadav Har'El added a comment - Hi, I just wanted to comment that a similar method was proposed in 2005 by several researchers in IBM (all of them have moved to Yahoo by now, actually). See "Inverted Index Support for Numeric Search", by M Fontoura, R Lempel, R Qi, J Zien http://webcourse.cs.technion.ac.il/236620/Winter2006-2007/hw/WCFiles/paramsearch.pdf The paper contains some discussion on the tradeoffs between the number of layers, number of terms, index size, and so on, so it might contain some valuable ideas even if you don't use their scheme exactly.
          Hide
          Uwe Schindler added a comment -

          Thank you, I did not know this paper. It is often rather complicated to find all papers about such subjects, very nice work!

          Show
          Uwe Schindler added a comment - Thank you, I did not know this paper. It is often rather complicated to find all papers about such subjects, very nice work!
          Hide
          Michael McCandless added a comment -

          Uwe, is this ready to be committed? It looks good, and the tests pass. Thanks!

          Show
          Michael McCandless added a comment - Uwe, is this ready to be committed? It looks good, and the tests pass. Thanks!
          Hide
          Uwe Schindler added a comment -

          It is almost complete. In my opinion the only change would be the setting of defaults. I wanted to move this into TrieUtils directly. Let me iterate one more patch and then we could commit. As I currently have no contributor status, it is simplier to first iterate the patch enough before committing. I was waiting for any additional comments before releasing a new patch version.

          I did not had time to do some benchmarks, so testing the three trie variants for speed/disk io/indexing speed is not yet done. My current benchmarks affect only the performance of my "old" trie code, not the new one using the more binary encoding. I asked for a good "benchmarking framework", is contrib/benchmark useful for that? For benchmarking you need to create rather big indexes, maybe containing random numeric values. So the benchmark may also need much disk space (1 Gig?), ok you can leave out stored fields and additional full text fields, but the benchamrk should at last have also a normal tokenized field for performance with combining trie / ordinal term queries (like in the paper given by Nadav).

          Do you think, it is good to directly include it into contrib-search? In my opinion, it is, but maybe others think different.

          Show
          Uwe Schindler added a comment - It is almost complete. In my opinion the only change would be the setting of defaults. I wanted to move this into TrieUtils directly. Let me iterate one more patch and then we could commit. As I currently have no contributor status, it is simplier to first iterate the patch enough before committing. I was waiting for any additional comments before releasing a new patch version. I did not had time to do some benchmarks, so testing the three trie variants for speed/disk io/indexing speed is not yet done. My current benchmarks affect only the performance of my "old" trie code, not the new one using the more binary encoding. I asked for a good "benchmarking framework", is contrib/benchmark useful for that? For benchmarking you need to create rather big indexes, maybe containing random numeric values. So the benchmark may also need much disk space (1 Gig?), ok you can leave out stored fields and additional full text fields, but the benchamrk should at last have also a normal tokenized field for performance with combining trie / ordinal term queries (like in the paper given by Nadav). Do you think, it is good to directly include it into contrib-search? In my opinion, it is, but maybe others think different.
          Hide
          Uwe Schindler added a comment -

          New version of the patch:

          • moved default trie variant handling to TrieUtils
          • method in TrieRangeFilter that returns the number of visited terms after execution of getDocIdSet()
          • tests print out the number of terms, other improvements to tests

          Just one question: is it ok to have a static initializer in the test to prevent the test from generating the rather large index for each test?

          Show
          Uwe Schindler added a comment - New version of the patch: moved default trie variant handling to TrieUtils method in TrieRangeFilter that returns the number of visited terms after execution of getDocIdSet() tests print out the number of terms, other improvements to tests Just one question: is it ok to have a static initializer in the test to prevent the test from generating the rather large index for each test?
          Hide
          Uwe Schindler added a comment -

          From my point of view: the patch is now ready to commit, all tests pass.

          The last patch also improved javadocs. Maybe some native speaker may please look into my javadocs, it is "Denglish" (as it is called in Germany)?

          Show
          Uwe Schindler added a comment - From my point of view: the patch is now ready to commit, all tests pass. The last patch also improved javadocs. Maybe some native speaker may please look into my javadocs, it is "Denglish" (as it is called in Germany)?
          Hide
          Michael McCandless added a comment -

          I asked for a good "benchmarking framework", is contrib/benchmark useful for that?

          I think it is good. You should probably make your own DocMaker to create random number fields, or, make a big line file and use LineDocMaker.

          And then also create your own QueryMaker to create TrieRangeQuery's.

          One nice test to add, to assert correctness, would be to iterate N times, randomly picking lower/upper bounds, and compare the builtin RangeFilter vs the TrieRangeFilter, asserting that they pull the same docs?

          is it ok to have a static initializer in the test to prevent the test from generating the rather large index for each test?

          I think that's fine.

          Show
          Michael McCandless added a comment - I asked for a good "benchmarking framework", is contrib/benchmark useful for that? I think it is good. You should probably make your own DocMaker to create random number fields, or, make a big line file and use LineDocMaker. And then also create your own QueryMaker to create TrieRangeQuery's. One nice test to add, to assert correctness, would be to iterate N times, randomly picking lower/upper bounds, and compare the builtin RangeFilter vs the TrieRangeFilter, asserting that they pull the same docs? is it ok to have a static initializer in the test to prevent the test from generating the rather large index for each test? I think that's fine.
          Hide
          Uwe Schindler added a comment -

          I have some first "benchmarks" on the three trie implementations. They were done using three indexes with same content, but encoded in 8bit, 4bit and 2bit, containing real-world data in 575,000 documents of the PANGAEA repository (see www.pangaea.de as mentioned before). The same range queries were executed after some warmup and time was measured until TopDocs returned the first 10 documents.

          The indexes each contain 13 numeric, tree encoded fields (doubles and Dates). Index size (including the "normal" fields) was:

          • 8bit: 4.8 GiB
          • 4bit: 5.1 GiB
          • 2bit: 5.7 GiB

          So the addition 13*8*575,000 extra trie terms (with longer) size took 300 MB going from 8 to 4bit. Going from 4 to 2bit, the additional 13*16*575,000 extra terms took 600 MB (additional to the step from 8 to 4 bit, but its linear to the number of additional terms).

          The performance impact was neglectible (or better the standard deviation was bigger than the performance plus), so I think still, 8bit is a good idea and index size is the smallest.

          My idea, why this is so: Using fewer bits increases number of terms in index (I do not know how this decreases performance), needs more TermEnum seeks (for each trie precision, 2 seeking operations are needed). These term enum seeks are faster for 8bit trie varaint, because the 2 chars prefix can be used extensively (first prefix=precision, second char = first byte of numeric value). With 4bit and 2bit you get only 4bit or 2bits in addition to precision marker. So seeks in term enum are slower.

          In comparison a ConstantScoreRangeQuery on the full precision trie-coded values took about 10-20 times longer. Here the numbers:
          For PANGAEA, all queries are half open (we have the problem, that data sets have ittself bounding boxes - minlatitude, maxlatitude, minlongitude, maxlongitude) and the user searches for bounding boxes, hits are datasets that intersect the search bounding box. So for a complete lat/lon constraint you have 4 half open ranges (with the current Trie implementation, this is not a problem, because two ANDed filters just withit half of the number of terms needed for a full range. The only performance impact is ANDing the two OpenBitSets). So 4 such half-open ranges ANDed together take the following times on the given index after some warmup, inkl fetching the first 10 documents:

          ConstantScoreRange: 1500-4000 ms
          TrieRangeQuery 8bit: 40ms-100ms
          TrieRangeQuery 4bit: 30ms-80ms
          TrieRangeQuery 2bit: 20ms-80ms

          The old RangeQuery was not tested, but that hits the MaxClauseCount constraint, and if this is raised to Integer.MAX_VALUE, it tooks approx 1 minute to finish). The numbers are pure rnage queries, if you additionally add some search terms, time goes up a little bit. Used was only TrieRangeQuery (so rewrite to constant score), no docs were filtered in the classical way: termqueries + filtering of results.

          More benchmark results follow, when I finish the benchmarker. But these are real world examples. The search engine machine was a Sun X4600 machine with 16 opteron cores, 64bit Java 1.5, Sun Java System Web Server, -Xmx 8192M (our current configuration). On lower cost machnies like desktop pcs, the ConstantScoreRangeQuery takes much more time, but TrieRangeQuery is approx the same time, as disk io seeks is more the limiting factor. Processor speed and number of processors is not the limiting factor (if only one query is run in parallel).

          Show
          Uwe Schindler added a comment - I have some first "benchmarks" on the three trie implementations. They were done using three indexes with same content, but encoded in 8bit, 4bit and 2bit, containing real-world data in 575,000 documents of the PANGAEA repository (see www.pangaea.de as mentioned before). The same range queries were executed after some warmup and time was measured until TopDocs returned the first 10 documents. The indexes each contain 13 numeric, tree encoded fields (doubles and Dates). Index size (including the "normal" fields) was: 8bit: 4.8 GiB 4bit: 5.1 GiB 2bit: 5.7 GiB So the addition 13*8*575,000 extra trie terms (with longer) size took 300 MB going from 8 to 4bit. Going from 4 to 2bit, the additional 13*16*575,000 extra terms took 600 MB (additional to the step from 8 to 4 bit, but its linear to the number of additional terms). The performance impact was neglectible (or better the standard deviation was bigger than the performance plus), so I think still, 8bit is a good idea and index size is the smallest. My idea, why this is so: Using fewer bits increases number of terms in index (I do not know how this decreases performance), needs more TermEnum seeks (for each trie precision, 2 seeking operations are needed). These term enum seeks are faster for 8bit trie varaint, because the 2 chars prefix can be used extensively (first prefix=precision, second char = first byte of numeric value). With 4bit and 2bit you get only 4bit or 2bits in addition to precision marker. So seeks in term enum are slower. In comparison a ConstantScoreRangeQuery on the full precision trie-coded values took about 10-20 times longer. Here the numbers: For PANGAEA, all queries are half open (we have the problem, that data sets have ittself bounding boxes - minlatitude, maxlatitude, minlongitude, maxlongitude) and the user searches for bounding boxes, hits are datasets that intersect the search bounding box. So for a complete lat/lon constraint you have 4 half open ranges (with the current Trie implementation, this is not a problem, because two ANDed filters just withit half of the number of terms needed for a full range. The only performance impact is ANDing the two OpenBitSets). So 4 such half-open ranges ANDed together take the following times on the given index after some warmup, inkl fetching the first 10 documents: ConstantScoreRange: 1500-4000 ms TrieRangeQuery 8bit: 40ms-100ms TrieRangeQuery 4bit: 30ms-80ms TrieRangeQuery 2bit: 20ms-80ms The old RangeQuery was not tested, but that hits the MaxClauseCount constraint, and if this is raised to Integer.MAX_VALUE, it tooks approx 1 minute to finish). The numbers are pure rnage queries, if you additionally add some search terms, time goes up a little bit. Used was only TrieRangeQuery (so rewrite to constant score), no docs were filtered in the classical way: termqueries + filtering of results. More benchmark results follow, when I finish the benchmarker. But these are real world examples. The search engine machine was a Sun X4600 machine with 16 opteron cores, 64bit Java 1.5, Sun Java System Web Server, -Xmx 8192M (our current configuration). On lower cost machnies like desktop pcs, the ConstantScoreRangeQuery takes much more time, but TrieRangeQuery is approx the same time, as disk io seeks is more the limiting factor. Processor speed and number of processors is not the limiting factor (if only one query is run in parallel).
          Hide
          Uwe Schindler added a comment - - edited

          Just one note: The queries in the comment before were taken on the same index with typical "user queries", the hit count and ranges were rather large (e.g. queries with about 120,000 total hits, so the typical case water around Africa, like Mike McCandless did in his first tests). They were not synthetic, done on the real running productive system (I only have two X4600 sun machines..., so tests are most time done on the life system, e.g. like a open-heart surgery).

          Show
          Uwe Schindler added a comment - - edited Just one note: The queries in the comment before were taken on the same index with typical "user queries", the hit count and ranges were rather large (e.g. queries with about 120,000 total hits, so the typical case water around Africa, like Mike McCandless did in his first tests). They were not synthetic, done on the real running productive system (I only have two X4600 sun machines..., so tests are most time done on the life system, e.g. like a open-heart surgery).
          Hide
          Paul Elschot added a comment -

          Uwe, thanks very much for the size and performance info. The space/time tradeoff between 8 and 4 bits is clear, and it's great to have that fexibility.

          Show
          Paul Elschot added a comment - Uwe, thanks very much for the size and performance info. The space/time tradeoff between 8 and 4 bits is clear, and it's great to have that fexibility.
          Hide
          Uwe Schindler added a comment -

          New Patch, I think this is really ready to commit. Includes Mike's suggestion to compare the result count of RangeQuery with TrieRangeQuery with random values. Also contains a further test on result count of ranges with an index containing values with distance=1. This is to detect errors, when the code generating the splitted range may fail to correctly attach the range parts to each other.

          During changing my own project panFMP to use the new contrib package, I needed a function to read stored, trie-encoded values for re-indexing with another trie variant. The new static methods in TrieUtils choose the variant for decoding using the encoded string length. These static methods can be used for easy decoding of stored fields that use the trie encoding without knowing the encoding. For range queries, the encoding cannot be autodetected (which is clear).

          Further work maybe a optimized sort algorith for the trie encoded fields. Current FieldCache cannot handle them as longs, only as Strings which is memory intensive. Having a FieldCache implementation for longs that support this encoding would be good. I do not want to do this with the current unflexible FieldCache implementation, so I wait for LUCENE-831. Has anybody an idea, how to plugin the correct FieldCache impl for this encoding in current Lucene, respecting the TrieUtils variant?

          Show
          Uwe Schindler added a comment - New Patch, I think this is really ready to commit. Includes Mike's suggestion to compare the result count of RangeQuery with TrieRangeQuery with random values. Also contains a further test on result count of ranges with an index containing values with distance=1. This is to detect errors, when the code generating the splitted range may fail to correctly attach the range parts to each other. During changing my own project panFMP to use the new contrib package, I needed a function to read stored, trie-encoded values for re-indexing with another trie variant. The new static methods in TrieUtils choose the variant for decoding using the encoded string length. These static methods can be used for easy decoding of stored fields that use the trie encoding without knowing the encoding. For range queries, the encoding cannot be autodetected (which is clear). Further work maybe a optimized sort algorith for the trie encoded fields. Current FieldCache cannot handle them as longs, only as Strings which is memory intensive. Having a FieldCache implementation for longs that support this encoding would be good. I do not want to do this with the current unflexible FieldCache implementation, so I wait for LUCENE-831 . Has anybody an idea, how to plugin the correct FieldCache impl for this encoding in current Lucene, respecting the TrieUtils variant?
          Hide
          Michael McCandless added a comment -

          Those results are nice, thanks Uwe! The open-heart surgery part was spooky though

          Index size (including the "normal" fields) was:

          As a baseline, how large is an index with just the normal fields?

          I'll commit shortly. This is a great addition!

          Show
          Michael McCandless added a comment - Those results are nice, thanks Uwe! The open-heart surgery part was spooky though Index size (including the "normal" fields) was: As a baseline, how large is an index with just the normal fields? I'll commit shortly. This is a great addition!
          Hide
          Uwe Schindler added a comment -

          I am sorry, one JavaDoc bug in the new static method, has a @link too much after @throws

          Show
          Uwe Schindler added a comment - I am sorry, one JavaDoc bug in the new static method, has a @link too much after @throws
          Hide
          Uwe Schindler added a comment -

          As a baseline, how large is an index with just the normal fields?

          This is not so simply to check out, I do not have such an index available and it takes long time to reindex The 3 variants were generated last night on the X4600 monster machine, which took 2 hours there (in parallel).

          But as the step from 8 to 4 bits was about 300 MBytes space more and the step doubled the number of terms, the index without the numeric fields (if its still linear) should be about 4.5 GiB.

          Sometimes, when changing program code of "live" systems where users are working on (sometimes you cannot avoid it), I feel really like doing a open-heart surgery

          Show
          Uwe Schindler added a comment - As a baseline, how large is an index with just the normal fields? This is not so simply to check out, I do not have such an index available and it takes long time to reindex The 3 variants were generated last night on the X4600 monster machine, which took 2 hours there (in parallel). But as the step from 8 to 4 bits was about 300 MBytes space more and the step doubled the number of terms, the index without the numeric fields (if its still linear) should be about 4.5 GiB. Sometimes, when changing program code of "live" systems where users are working on (sometimes you cannot avoid it), I feel really like doing a open-heart surgery
          Hide
          Michael McCandless added a comment -

          But as the step from 8 to 4 bits was about 300 MBytes space more and the step doubled the number of terms, the index without the numeric fields (if its still linear) should be about 4.5 GiB

          OK thanks!

          Show
          Michael McCandless added a comment - But as the step from 8 to 4 bits was about 300 MBytes space more and the step doubled the number of terms, the index without the numeric fields (if its still linear) should be about 4.5 GiB OK thanks!
          Hide
          Michael McCandless added a comment -

          Committed revision 723031.

          Thanks Uwe!

          Show
          Michael McCandless added a comment - Committed revision 723031. Thanks Uwe!
          Hide
          Michael McCandless added a comment -

          Uwe, can't you use ExtendedFieldCache.getLongs, passing in your own LongParser?

          Show
          Michael McCandless added a comment - Uwe, can't you use ExtendedFieldCache.getLongs, passing in your own LongParser?
          Hide
          Uwe Schindler added a comment -

          The question is, how to plugin the whole FieldCache code from contrib? I see there is only one default ExtendedFieldCache named ExtendedFieldCacheImpl, but how do I implement a custom field cache and tell the searchers to use it whenever they see a Trie encoded field (e.g. when Sort.AUTO is used).? This is one thing, that I do not understand how to do.

          A very simple LongParser may be:

            new LongParser() {
                public long parseLong(String value) {
                  return TrieUtils.VARIANT_8BIT.trieCodedToLong(value);
                }
            };
          

          But what to do with it for sorting search results...

          Show
          Uwe Schindler added a comment - The question is, how to plugin the whole FieldCache code from contrib? I see there is only one default ExtendedFieldCache named ExtendedFieldCacheImpl, but how do I implement a custom field cache and tell the searchers to use it whenever they see a Trie encoded field (e.g. when Sort.AUTO is used).? This is one thing, that I do not understand how to do. A very simple LongParser may be: new LongParser() { public long parseLong( String value) { return TrieUtils.VARIANT_8BIT.trieCodedToLong(value); } }; But what to do with it for sorting search results...
          Hide
          Michael McCandless added a comment -

          I think if you call ExtendedFieldCache.EXT_DEFAULT.getLongs, first, with your parser, and then sort by that field as a long, it will just use what's already populated into the cache? (It is sort of messy though; ideally there would be some way to record the LongParser you'd like used for a given field).

          Show
          Michael McCandless added a comment - I think if you call ExtendedFieldCache.EXT_DEFAULT.getLongs, first, with your parser, and then sort by that field as a long, it will just use what's already populated into the cache? (It is sort of messy though; ideally there would be some way to record the LongParser you'd like used for a given field).
          Hide
          Uwe Schindler added a comment -

          This is messy and not easy to understand for a beginner. I was thinking of a possibility to supply a class in the contrib package o.a.l.search.trie that extends SortField or Sort or whatever else to mark a field explicitely as "trie-encoded" (which could also use the automatic trie variant detection in TrieUtils.trieCodedToLongAuto()), that can be used for calling Searcher.search(). And if TrieUtils would not be in Contrib, the AUTO parser could be extended to also try TrieUtils for deconding the field value (as it also throws NumberFormatException, it would fit in wonderful).

          Another idea was to overwrite ExtendedFieldCacheImpl and set it in the (non-final) static member ExtendedFieldCache.EXT_DEFAULT (which would be possible), but ExtendedFieldCacheImpl is package-protected. The whole sort thing is a bit messy, this is why I was hoping that LUCENE-831 may fix this problem in future, maybe we move this discussion to this issue.

          For the beginning, I could simply supply a static field parser instance in TrieUtils, e.g. LongParser named TrieUtils.TRIE_FIELD_CACHE_PARSER, that uses autodetection or the concrete trie variant.

          Show
          Uwe Schindler added a comment - This is messy and not easy to understand for a beginner. I was thinking of a possibility to supply a class in the contrib package o.a.l.search.trie that extends SortField or Sort or whatever else to mark a field explicitely as "trie-encoded" (which could also use the automatic trie variant detection in TrieUtils.trieCodedToLongAuto()), that can be used for calling Searcher.search(). And if TrieUtils would not be in Contrib, the AUTO parser could be extended to also try TrieUtils for deconding the field value (as it also throws NumberFormatException, it would fit in wonderful). Another idea was to overwrite ExtendedFieldCacheImpl and set it in the (non-final) static member ExtendedFieldCache.EXT_DEFAULT (which would be possible), but ExtendedFieldCacheImpl is package-protected. The whole sort thing is a bit messy, this is why I was hoping that LUCENE-831 may fix this problem in future, maybe we move this discussion to this issue. For the beginning, I could simply supply a static field parser instance in TrieUtils, e.g. LongParser named TrieUtils.TRIE_FIELD_CACHE_PARSER, that uses autodetection or the concrete trie variant.
          Hide
          Uwe Schindler added a comment -

          I think if you call ExtendedFieldCache.EXT_DEFAULT.getLongs, first, with your parser, and then sort by that field as a long, it will just use what's already populated into the cache?

          I think, this cannot work. The Cache is keyed by FieldCacheImpl.Entry containing the parser to use. If search code would later call getLongs() with the default parser, it would get a new array of longs, not the one previously parsed with the specific parser, as there is no cache hit (because Entry is not equal). Do I miss something here?

          Show
          Uwe Schindler added a comment - I think if you call ExtendedFieldCache.EXT_DEFAULT.getLongs, first, with your parser, and then sort by that field as a long, it will just use what's already populated into the cache? I think, this cannot work. The Cache is keyed by FieldCacheImpl.Entry containing the parser to use. If search code would later call getLongs() with the default parser, it would get a new array of longs, not the one previously parsed with the specific parser, as there is no cache hit (because Entry is not equal). Do I miss something here?
          Hide
          Uwe Schindler added a comment - - edited

          Hi Mike,
          the last Hudson build failed. It seems it did not like that I used "LuceneTestCase" instead of standard JUnit TestCase in one of the tests. I copied the source of this test from some other class. Maybe I cannot use LuceneTestCase in Contrib. But this is not a problem, it works identical with the JUNIT default class.

          Show
          Uwe Schindler added a comment - - edited Hi Mike, the last Hudson build failed. It seems it did not like that I used "LuceneTestCase" instead of standard JUnit TestCase in one of the tests. I copied the source of this test from some other class. Maybe I cannot use LuceneTestCase in Contrib. But this is not a problem, it works identical with the JUNIT default class.
          Hide
          Uwe Schindler added a comment -

          Sorry for again a new patch: When again looking into the test, I missed a test for the automatic encoding detection by string length (TrieUtils.trieCodedToXxxAuto()).
          The appended patch fixes the hudson build and adds this test.

          Show
          Uwe Schindler added a comment - Sorry for again a new patch: When again looking into the test, I missed a test for the automatic encoding detection by string length (TrieUtils.trieCodedToXxxAuto()). The appended patch fixes the hudson build and adds this test.
          Hide
          Michael McCandless added a comment - - edited

          Hmm – I would prefer that contrib tests subclass LiaTestCase. We must be missing a dependency in the ant build files.

          OK this seems to fix it:

          Index: contrib/contrib-build.xml
          ===================================================================
          --- contrib/contrib-build.xml	(revision 723145)
          +++ contrib/contrib-build.xml	(working copy)
          @@ -61,7 +61,7 @@
             </target>
           
             
          -  <target name="init" depends="common.init,build-lucene"/>
          +  <target name="init" depends="common.init,build-lucene,build-lucene-tests"/>
             <target name="compile-test" depends="init" if="contrib.has.tests">
               <antcall target="common.compile-test" inheritRefs="true" />
             </target>
          

          I'll commit that, and the fix to the test case. Thanks Uwe!

          Show
          Michael McCandless added a comment - - edited Hmm – I would prefer that contrib tests subclass LiaTestCase. We must be missing a dependency in the ant build files. OK this seems to fix it: Index: contrib/contrib-build.xml =================================================================== --- contrib/contrib-build.xml (revision 723145) +++ contrib/contrib-build.xml (working copy) @@ -61,7 +61,7 @@ </target> - <target name= "init" depends= "common.init,build-lucene" /> + <target name= "init" depends= "common.init,build-lucene,build-lucene-tests" /> <target name= "compile-test" depends= "init" if = "contrib.has.tests" > <antcall target= "common.compile-test" inheritRefs= " true " /> </target> I'll commit that, and the fix to the test case. Thanks Uwe!
          Hide
          Michael McCandless added a comment -

          Hmm - I would prefer that contrib tests subclass LiaTestCase

          Woops, I meant LuceneTestCase Time sharing not working very well in my brain this morning...

          Show
          Michael McCandless added a comment - Hmm - I would prefer that contrib tests subclass LiaTestCase Woops, I meant LuceneTestCase Time sharing not working very well in my brain this morning...
          Hide
          Michael McCandless added a comment -

          Committed revision 723287.

          Show
          Michael McCandless added a comment - Committed revision 723287.
          Hide
          Michael McCandless added a comment -

          I think, this cannot work. The Cache is keyed by FieldCacheImpl.Entry containing the parser to use.

          Sigh, you are correct. How would you fix FieldCache?

          I guess the workaround is to also index the original value (unencoded by TrieUtils) as an additional field, for sorting.

          Show
          Michael McCandless added a comment - I think, this cannot work. The Cache is keyed by FieldCacheImpl.Entry containing the parser to use. Sigh, you are correct. How would you fix FieldCache? I guess the workaround is to also index the original value (unencoded by TrieUtils) as an additional field, for sorting.
          Hide
          Uwe Schindler added a comment -

          Thanks, then I would also change TestTrieRangeQuery to also use LuceneTestCase, just for completeness.

          Sigh, you are correct. How would you fix FieldCache?

          I would fix FieldCache by giving in SortField the possibility to supply a parser instance. So you create a SortField using a new constructor SortField(String field, int type, Object parser, boolean reverse). The parser is "object" bcause all parsers have no super-interface. The ideal solution would be to have:

          SortField(String field, int type, FieldCache.Parser parser, boolean reverse)

          and FieldCache.Parser is a super-interface (just empty, more like a marker-interface) of all other parsers (like LongParser...)

          I guess the workaround is to also index the original value (unencoded by TrieUtils) as an additional field, for sorting.

          The problem with the extra field would be, that it works good for longs or doubles (with some extra work), but Dates still keep as String, or you use Date.getTime() as long. But this is not very elegant and needs more fields and terms. I prefer a clean solution.

          Show
          Uwe Schindler added a comment - Thanks, then I would also change TestTrieRangeQuery to also use LuceneTestCase, just for completeness. Sigh, you are correct. How would you fix FieldCache? I would fix FieldCache by giving in SortField the possibility to supply a parser instance. So you create a SortField using a new constructor SortField(String field, int type, Object parser, boolean reverse). The parser is "object" bcause all parsers have no super-interface. The ideal solution would be to have: SortField(String field, int type, FieldCache.Parser parser, boolean reverse) and FieldCache.Parser is a super-interface (just empty, more like a marker-interface) of all other parsers (like LongParser...) I guess the workaround is to also index the original value (unencoded by TrieUtils) as an additional field, for sorting. The problem with the extra field would be, that it works good for longs or doubles (with some extra work), but Dates still keep as String, or you use Date.getTime() as long. But this is not very elegant and needs more fields and terms. I prefer a clean solution.
          Hide
          Michael McCandless added a comment -

          Thanks, then I would also change TestTrieRangeQuery to also use LuceneTestCase, just for completeness.

          OK done.

          would fix FieldCache by giving in SortField the possibility to supply a parser instance. So you create a SortField using a new constructor SortField(String field, int type, Object parser, boolean reverse). The parser is "object" bcause all parsers have no super-interface.

          This seems OK for now? Can you open an issue? Retro-fitting a super-interface would break back-compat for (admittedly very advanced) existing Parser instances external to Lucene, right?

          but Dates still keep as String, or you use Date.getTime() as long

          Yeah. But if we open the new issue (to allow external FieldCache parsers to be used when sorting) then one could parse to long directly from a TrieUtil encoded Date field, right?

          Show
          Michael McCandless added a comment - Thanks, then I would also change TestTrieRangeQuery to also use LuceneTestCase, just for completeness. OK done. would fix FieldCache by giving in SortField the possibility to supply a parser instance. So you create a SortField using a new constructor SortField(String field, int type, Object parser, boolean reverse). The parser is "object" bcause all parsers have no super-interface. This seems OK for now? Can you open an issue? Retro-fitting a super-interface would break back-compat for (admittedly very advanced) existing Parser instances external to Lucene, right? but Dates still keep as String, or you use Date.getTime() as long Yeah. But if we open the new issue (to allow external FieldCache parsers to be used when sorting) then one could parse to long directly from a TrieUtil encoded Date field, right?
          Hide
          Uwe Schindler added a comment -

          Yes, I will open an issue! Maybe I maybe create a first patch after looking into the problem.

          This seems OK for now? Can you open an issue? Retro-fitting a super-interface would break back-compat for (admittedly very advanced) existing Parser instances external to Lucene, right?

          I am not sure, but I think its better to leave it as now. On the other hand, if we just have a "marker" super-interface, it should be backwards compatible, because the new super-interface is new and existing code would only use the existing interfaces. New methods are not added by the super interface, so code would be source and binary compatible (as it only references the existing interfaces). I think we had this discussion some time in the past in another issue (Fieldable???), but this was another problem.

          Yeah. But if we open the new issue (to allow external FieldCache parsers to be used when sorting) then one could parse to long directly from a TrieUtil encoded Date field, right?

          Correct. As soon as this works, I would simply add as "extra bonus" o.a.l.search.trie.TrieSortField, that automatically supplys a correct parser for easy usage. Date, Double and Long trie fields can always be sorted as longs without knowing the correct meaning (because the trie format was designed like so).

          Currently my code would just sort the trie encoded fields using SortField.STRING, but this resource expensive (butI have no example currently running, as it was not needed for panFMP/PANGAEA and other projects).

          Show
          Uwe Schindler added a comment - Yes, I will open an issue! Maybe I maybe create a first patch after looking into the problem. This seems OK for now? Can you open an issue? Retro-fitting a super-interface would break back-compat for (admittedly very advanced) existing Parser instances external to Lucene, right? I am not sure, but I think its better to leave it as now. On the other hand, if we just have a "marker" super-interface, it should be backwards compatible, because the new super-interface is new and existing code would only use the existing interfaces. New methods are not added by the super interface, so code would be source and binary compatible (as it only references the existing interfaces). I think we had this discussion some time in the past in another issue (Fieldable???), but this was another problem. Yeah. But if we open the new issue (to allow external FieldCache parsers to be used when sorting) then one could parse to long directly from a TrieUtil encoded Date field, right? Correct. As soon as this works, I would simply add as "extra bonus" o.a.l.search.trie.TrieSortField, that automatically supplys a correct parser for easy usage. Date, Double and Long trie fields can always be sorted as longs without knowing the correct meaning (because the trie format was designed like so). Currently my code would just sort the trie encoded fields using SortField.STRING, but this resource expensive (butI have no example currently running, as it was not needed for panFMP/PANGAEA and other projects).
          Hide
          Uwe Schindler added a comment -

          Hi Mike,

          I opened issue LUCENE-1478 and attached a first patch.

          About the current issue: I have seen that TrieRangeQuery is missing in /lucene/java/trunk/contrib/queries/README.txt. Can you add it there or should I write a small patch? I think it should at least be mentioned there for what it is for, but the JavaDocs are much more informative and the corresponding paper / code credits are cited there.

          Thank you very much for helping to get this into Lucene!

          Show
          Uwe Schindler added a comment - Hi Mike, I opened issue LUCENE-1478 and attached a first patch. About the current issue: I have seen that TrieRangeQuery is missing in /lucene/java/trunk/contrib/queries/README.txt. Can you add it there or should I write a small patch? I think it should at least be mentioned there for what it is for, but the JavaDocs are much more informative and the corresponding paper / code credits are cited there. Thank you very much for helping to get this into Lucene!
          Hide
          Michael McCandless added a comment -

          Thank you very much for helping to get this into Lucene!

          You're welcome! But, that was the easy part Thank you for creating it & getting it into Lucene!

          About the current issue: I have seen that TrieRangeQuery is missing in /lucene/java/trunk/contrib/queries/README.txt.

          I agree – can you create a patch? Thanks.

          Show
          Michael McCandless added a comment - Thank you very much for helping to get this into Lucene! You're welcome! But, that was the easy part Thank you for creating it & getting it into Lucene! About the current issue: I have seen that TrieRangeQuery is missing in /lucene/java/trunk/contrib/queries/README.txt. I agree – can you create a patch? Thanks.
          Hide
          Uwe Schindler added a comment -

          Here the readme changes.

          Show
          Uwe Schindler added a comment - Here the readme changes.
          Hide
          Michael McCandless added a comment -

          Committed revision 723701.

          Thanks Uwe!

          Show
          Michael McCandless added a comment - Committed revision 723701. Thanks Uwe!
          Hide
          Yonik Seeley added a comment - - edited

          Attaching completely untested prototype TrueUtils.java

          some discussion:
          http://www.lucidimagination.com/search/document/d62c0fd21d88f880

          Features:

          • same encode/decode code works for any variant... no 2,4,8 bit specific instances
          • decouples "slicing" of the value into different precisions and encoding of the slice to a String, allowing for the most efficient String encoding to be used for every prevision variant.
          • 7 bit char encoding to optimize for UTF8 index storage
          • right justified to allow lucene to prefix compress efficiently
          • separates creation of sortableBits from trie encoding of those bits to avoid so many methods
          • allows indexing into multiple fields, or all in the same field
          • much smaller code should be much easier to understand
          • left out "Date" support - the average Java developer understands how to go from a Date to a long (unlike double, etc).
          • relatively trivial to add 32 bit (int/float) support and reuse code like addIndexedFields (which is just an agnostic helper method).
          Show
          Yonik Seeley added a comment - - edited Attaching completely untested prototype TrueUtils.java some discussion: http://www.lucidimagination.com/search/document/d62c0fd21d88f880 Features: same encode/decode code works for any variant... no 2,4,8 bit specific instances decouples "slicing" of the value into different precisions and encoding of the slice to a String, allowing for the most efficient String encoding to be used for every prevision variant. 7 bit char encoding to optimize for UTF8 index storage right justified to allow lucene to prefix compress efficiently separates creation of sortableBits from trie encoding of those bits to avoid so many methods allows indexing into multiple fields, or all in the same field much smaller code should be much easier to understand left out "Date" support - the average Java developer understands how to go from a Date to a long (unlike double, etc). relatively trivial to add 32 bit (int/float) support and reuse code like addIndexedFields (which is just an agnostic helper method).
          Hide
          Uwe Schindler added a comment -

          Thanks for the code fragments, I can take this as basis and will implement the TrieRangeFilter counterpart. I have some more ideas (e.g. for the beginners API there is missing the possibility to store the full-precision value). And norms should be disabled per default (in beginners API). 32bit support is also simple, as you noted.

          The problem is now, how to make TrieRangeFilter so generic, to support all posible encodings and field names possible (with your code, it would also be possible to put each precision in a separate field). I work on that, I want to have clean API on the TrieRangeFilter!

          I would name both addIndexedFields() and addField with the same name (just overload), but different options. I prepare that!

          I agree, date support is not needed. And if I would again add Date support, Calendars should also be possible etc. No need for that - Date should be deprecated by Sun!

          One question: The encoding is now different from NumberUtils again - NumberUtils tries to get the most out of each char vs. this tries to not affect UTF-8 encoding and use ASCII only? Would Solr use this encoding in future (7bit chars) for numeric values or should be both separate? These TrieUtils now also make it possible to not use trie coding, but encode doubles/longs for other use (and use the currently missing LongParser/SortField generators).

          Show
          Uwe Schindler added a comment - Thanks for the code fragments, I can take this as basis and will implement the TrieRangeFilter counterpart. I have some more ideas (e.g. for the beginners API there is missing the possibility to store the full-precision value). And norms should be disabled per default (in beginners API). 32bit support is also simple, as you noted. The problem is now, how to make TrieRangeFilter so generic, to support all posible encodings and field names possible (with your code, it would also be possible to put each precision in a separate field). I work on that, I want to have clean API on the TrieRangeFilter! I would name both addIndexedFields() and addField with the same name (just overload), but different options. I prepare that! I agree, date support is not needed. And if I would again add Date support, Calendars should also be possible etc. No need for that - Date should be deprecated by Sun! One question: The encoding is now different from NumberUtils again - NumberUtils tries to get the most out of each char vs. this tries to not affect UTF-8 encoding and use ASCII only? Would Solr use this encoding in future (7bit chars) for numeric values or should be both separate? These TrieUtils now also make it possible to not use trie coding, but encode doubles/longs for other use (and use the currently missing LongParser/SortField generators).
          Hide
          Uwe Schindler added a comment - - edited

          Adding that as comment:

          > On Sat, Feb 7, 2009 at 12:29 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
          > > This is only a minimal optimization, suitable for very large indexes.
          > The
          > > problem is: if you have many terms in highest precission (a lot of
          > different
          > > double values), seeking is more costly if you jump from higher to lower
          > > precisions.
          >
          > That's my point... in very large indexes this should not result in any
          > difference at all on average because the terms would be no where near
          > each other.

          OK.

          I prepare a new TrieRangeFilter implementation, just taking the String[] fieldnames and the sortableLong and the precisionStep.

          And I think, you are right. We could completely remove the "storing" API. If one wants to add stored fields, he could use NumberUtils and do this separately (add stored, not indexed fields). For TrieRangeFilter it is not neded.

          > As an example: in a very big index, one wants to independently collect
          > all documents that match "apple" and all documents that match "zebra",
          > which term you seek to first should not matter.

          OK, I agree

          Show
          Uwe Schindler added a comment - - edited Adding that as comment: > On Sat, Feb 7, 2009 at 12:29 PM, Uwe Schindler <uwe@thetaphi.de> wrote: > > This is only a minimal optimization, suitable for very large indexes. > The > > problem is: if you have many terms in highest precission (a lot of > different > > double values), seeking is more costly if you jump from higher to lower > > precisions. > > That's my point... in very large indexes this should not result in any > difference at all on average because the terms would be no where near > each other. OK. I prepare a new TrieRangeFilter implementation, just taking the String[] fieldnames and the sortableLong and the precisionStep. And I think, you are right. We could completely remove the "storing" API. If one wants to add stored fields, he could use NumberUtils and do this separately (add stored, not indexed fields). For TrieRangeFilter it is not neded. > As an example: in a very big index, one wants to independently collect > all documents that match "apple" and all documents that match "zebra", > which term you seek to first should not matter. OK, I agree
          Hide
          Uwe Schindler added a comment -

          Reopening this issue. I will later prepare a patch (have no time now, its now Saturday evening here in Germany), completely changing the current API and encoding, thanks Yonik! I think, the TrieRangeFilter will get smaller, too (ok, the c'tors may stay for beginners). The Javadocs need to be updated too.

          I will also add missing methods for LongParser and SortField, as the encoded fields can be stored in ExtendedFieldCache (but there as real Longs, not sortableLongs) - as before.

          I will also add 32 bit API.

          Thanks, Yonik!

          Show
          Uwe Schindler added a comment - Reopening this issue. I will later prepare a patch (have no time now, its now Saturday evening here in Germany), completely changing the current API and encoding, thanks Yonik! I think, the TrieRangeFilter will get smaller, too (ok, the c'tors may stay for beginners). The Javadocs need to be updated too. I will also add missing methods for LongParser and SortField, as the encoded fields can be stored in ExtendedFieldCache (but there as real Longs, not sortableLongs) - as before. I will also add 32 bit API. Thanks, Yonik!
          Hide
          Yonik Seeley added a comment -

          for the beginners API there is missing the possibility to store the full-precision value

          They could simply store it in a different field, in whatever format they desire, right? It seems like TrieRange should be about range matching, not the format of stored fields.

          NumberUtils tries to get the most out of each char vs. this tries to not affect UTF-8 encoding and use ASCII only?

          NumberUtils in Solr was developed a long time ago, before Parser support in the FieldCache, etc (Lucene 1.4). I chose 14 bit numbers to minimize size in FieldCache using a StringIndex, and because I didn't understand Lucene prefix compression at the time

          If there are to be many in-memory representations, then using 14 bit chars might be better. Otherwise it seems like 7 bit might be preferable (better prefix compression, more predictable branches in the UTF8 encoder/decoder). Of course it's a trivial switch, so perhaps we should just try and benchmark it when everything else is done.

          As for TrieRangeFilter, I guess the most generic constructor would look like:

          TrieRangeFilter(int precisionStep, String[] fields, long lowerSortableBits, long upperSortableBits, boolean includeLower, boolean includeUpper)
          
          Show
          Yonik Seeley added a comment - for the beginners API there is missing the possibility to store the full-precision value They could simply store it in a different field, in whatever format they desire, right? It seems like TrieRange should be about range matching, not the format of stored fields. NumberUtils tries to get the most out of each char vs. this tries to not affect UTF-8 encoding and use ASCII only? NumberUtils in Solr was developed a long time ago, before Parser support in the FieldCache, etc (Lucene 1.4). I chose 14 bit numbers to minimize size in FieldCache using a StringIndex, and because I didn't understand Lucene prefix compression at the time If there are to be many in-memory representations, then using 14 bit chars might be better. Otherwise it seems like 7 bit might be preferable (better prefix compression, more predictable branches in the UTF8 encoder/decoder). Of course it's a trivial switch, so perhaps we should just try and benchmark it when everything else is done. As for TrieRangeFilter, I guess the most generic constructor would look like: TrieRangeFilter( int precisionStep, String [] fields, long lowerSortableBits, long upperSortableBits, boolean includeLower, boolean includeUpper)
          Hide
          Uwe Schindler added a comment -

          NumberUtils in Solr was developed a long time ago, before Parser support in the FieldCache, etc (Lucene 1.4). I chose 14 bit numbers to minimize size in FieldCache using a StringIndex, and because I didn't understand Lucene prefix compression at the time

          The same here . By the way, the new SortField constructors taking a LongParser (LUCENE-1478) as parameter make this very simple. You could so sort by the trie encoded field (and do not need to index them separately). Just use the LongParser supplied by TrieUtils to sort. No String sorting needed. I only wanted to supply static parsers based on TrieUtils for that.

          Show
          Uwe Schindler added a comment - NumberUtils in Solr was developed a long time ago, before Parser support in the FieldCache, etc (Lucene 1.4). I chose 14 bit numbers to minimize size in FieldCache using a StringIndex, and because I didn't understand Lucene prefix compression at the time The same here . By the way, the new SortField constructors taking a LongParser ( LUCENE-1478 ) as parameter make this very simple. You could so sort by the trie encoded field (and do not need to index them separately). Just use the LongParser supplied by TrieUtils to sort. No String sorting needed. I only wanted to supply static parsers based on TrieUtils for that.
          Hide
          Uwe Schindler added a comment -

          For TrieRangeFilter I now also need something similar to the current increment/decrementTrieCoded, that adds 1 to a prefix coded value (in principle, just add 1<<shift to the long an use it, something like that) This is needed for the matching of the parts of the range. More later.

          Show
          Uwe Schindler added a comment - For TrieRangeFilter I now also need something similar to the current increment/decrementTrieCoded, that adds 1 to a prefix coded value (in principle, just add 1<<shift to the long an use it, something like that) This is needed for the matching of the parts of the range. More later.
          Hide
          Uwe Schindler added a comment - - edited

          I modified the proposal TrieUtils a little bit. Maybe this class could get the new NumberUtils with the extra option to trie encode the values.

          • Added 32bit support
          • Merged the unsigned int/long handling into prefixCode methods. For TrieRangeFilter the raw bits will not be needed (I need compareable signed ints/longs).
          • The conversion from doubles and floats was renamed and returns the standard signed long/int: doubleToSortableLong and floatToSortableInt, Date is removed (as just Date.getTime() can be used, as everybody knows).
          • Still missing are Long/IntParser for FieldCache and a SortField factory.

          It's still untested!

          I will implement tomorrow (now its time to go to bed) the TrieRangeFilter in two variants (one for 32 bit ints another for 64 bit longs). The min/max values are the ints/longs. The trie coding is also done using the shift value and bit magic. The results of the range split are then encoded using TrieUtils.xxxToPrefixCode().

          Show
          Uwe Schindler added a comment - - edited I modified the proposal TrieUtils a little bit. Maybe this class could get the new NumberUtils with the extra option to trie encode the values. Added 32bit support Merged the unsigned int/long handling into prefixCode methods. For TrieRangeFilter the raw bits will not be needed (I need compareable signed ints/longs). The conversion from doubles and floats was renamed and returns the standard signed long/int: doubleToSortableLong and floatToSortableInt, Date is removed (as just Date.getTime() can be used, as everybody knows). Still missing are Long/IntParser for FieldCache and a SortField factory. It's still untested! I will implement tomorrow (now its time to go to bed) the TrieRangeFilter in two variants (one for 32 bit ints another for 64 bit longs). The min/max values are the ints/longs. The trie coding is also done using the shift value and bit magic. The results of the range split are then encoded using TrieUtils.xxxToPrefixCode().
          Hide
          Yonik Seeley added a comment -

          Merged the unsigned int/long handling into prefixCode methods. For TrieRangeFilter the raw bits will not be needed (I need compareable signed ints/longs).

          Nice Uwe... we're really on the same wavelength now - I came to the exact same conclusion and made the exact same changes! It's much nicer having ints and longs as the exposed "interface" so if a<b in java then a will come before b in the term index.

          Show
          Yonik Seeley added a comment - Merged the unsigned int/long handling into prefixCode methods. For TrieRangeFilter the raw bits will not be needed (I need compareable signed ints/longs). Nice Uwe... we're really on the same wavelength now - I came to the exact same conclusion and made the exact same changes! It's much nicer having ints and longs as the exposed "interface" so if a<b in java then a will come before b in the term index.
          Hide
          Uwe Schindler added a comment -

          A fixed TrieUtils (a small bug in the encoding routine caused endless loop), the increment in array was missing

          Show
          Uwe Schindler added a comment - A fixed TrieUtils (a small bug in the encoding routine caused endless loop), the increment in array was missing
          Hide
          Uwe Schindler added a comment -

          A first test version of TrieRangeFilter. Much of the functionality iss missing (no hashCode, some dead parts,....). But: it works. I modified the tests and get it running. Both variants (old and new) visit the same number of terms and get exact results.

          To do further testing it is now important, to generate Ranges with negative long. It seems to work, but I need more tests. It was a hell to get it running, I hate Sun for not having unsigned ints/longs in Java

          TrieRangeQuery is obsolete now. You can wrap the Filter using a ConstantScore query or user TrieRangeFilter.asQuery() [which will return a nicer toString() output).

          I will now clean up everything and create a patch!

          Show
          Uwe Schindler added a comment - A first test version of TrieRangeFilter. Much of the functionality iss missing (no hashCode, some dead parts,....). But: it works. I modified the tests and get it running. Both variants (old and new) visit the same number of terms and get exact results. To do further testing it is now important, to generate Ranges with negative long. It seems to work, but I need more tests. It was a hell to get it running, I hate Sun for not having unsigned ints/longs in Java TrieRangeQuery is obsolete now. You can wrap the Filter using a ConstantScore query or user TrieRangeFilter.asQuery() [which will return a nicer toString() output). I will now clean up everything and create a patch!
          Hide
          Uwe Schindler added a comment -

          I forget to mention:
          I modified the TrieUtils to have another SHIFT_START for ints and longs. By this you can earlier throw a NumberFormatException of you try to decode long using the int method or vice versa. I added these checks, to fail early, when old indexes using the different encoding are used or sorting may use the wrong encoding. TrieUtils still needs the FieldCache parser, but this is a trivial addition.

          Show
          Uwe Schindler added a comment - I forget to mention: I modified the TrieUtils to have another SHIFT_START for ints and longs. By this you can earlier throw a NumberFormatException of you try to decode long using the int method or vice versa. I added these checks, to fail early, when old indexes using the different encoding are used or sorting may use the wrong encoding. TrieUtils still needs the FieldCache parser, but this is a trivial addition.
          Hide
          Uwe Schindler added a comment -

          Yonik:
          How would you hande a TrieRangeFilter for int's? I could create a separate class that looks identical (but uses the other TrieUtils methods and int instead of longs), or combine both in one class (which is hard). The problem is, you cannot have the same codebase, because masks, shifts, 31 vs. 63 is everywhere. And the other datatype.
          If I do it, I will have two classes: TrieRangeFilter32, TrieRangeFilter64, alternatively TrieRangeFilterInt, TrieRangeFilterLong. Whats your opinion?

          Show
          Uwe Schindler added a comment - Yonik: How would you hande a TrieRangeFilter for int's? I could create a separate class that looks identical (but uses the other TrieUtils methods and int instead of longs), or combine both in one class (which is hard). The problem is, you cannot have the same codebase, because masks, shifts, 31 vs. 63 is everywhere. And the other datatype. If I do it, I will have two classes: TrieRangeFilter32, TrieRangeFilter64, alternatively TrieRangeFilterInt, TrieRangeFilterLong. Whats your opinion?
          Hide
          Uwe Schindler added a comment -

          Now the next step: A corrected TrieUtils again.

          • contains the FieldCache parsers for easy usage and sort field factories.
          • changes the SHIFT for ints (0x80 was too much, UTF-8 optimization needs chars <0x80)

          Sorting tests now also work.

          Show
          Uwe Schindler added a comment - Now the next step: A corrected TrieUtils again. contains the FieldCache parsers for easy usage and sort field factories. changes the SHIFT for ints (0x80 was too much, UTF-8 optimization needs chars <0x80) Sorting tests now also work.
          Hide
          Yonik Seeley added a comment -

          Uploading a new version of TrieUtils.

          I've implemented my own range splitting logic (it will be interesting to see how it compares to yours, and if mine actually works! so many edge cases...)

          It separates the range splitting logic from the execution of anything, so we can use the same logic to generate queries that match the same range (no building an OpenBitSet first), or other things like printable queries for debugging.

          After reflection, the RangeBuilder interface could be simplified to just have a single addRange() method... with the implicit assumption that all ranges are ORd together.

          Show
          Yonik Seeley added a comment - Uploading a new version of TrieUtils. I've implemented my own range splitting logic (it will be interesting to see how it compares to yours, and if mine actually works! so many edge cases...) It separates the range splitting logic from the execution of anything, so we can use the same logic to generate queries that match the same range (no building an OpenBitSet first), or other things like printable queries for debugging. After reflection, the RangeBuilder interface could be simplified to just have a single addRange() method... with the implicit assumption that all ranges are ORd together.
          Hide
          Yonik Seeley added a comment -

          Since filters are a symbolic representation (like Query), it makes sense to have separate TrieRangeFilter32 and TrieRangeFilter64 classes.
          Hopefully most of the logic that is in common can be shared somehow.

          Show
          Yonik Seeley added a comment - Since filters are a symbolic representation (like Query), it makes sense to have separate TrieRangeFilter32 and TrieRangeFilter64 classes. Hopefully most of the logic that is in common can be shared somehow.
          Hide
          Yonik Seeley added a comment -

          I think the range splitting code I wrote is generic enough to also automatically handle 32 bits too.
          I think it's just a matter of starting with a different shift?

          So for longs we have:

            public static Object buildRange(RangeBuilder builder, long ll, long uu, int precisionStep) {
              // figure out where to start shift at
              int shift = ((64-1)/precisionStep) * precisionStep;
              return buildRange(builder, ll, uu, shift, precisionStep);
            }
          

          And for ints, it would simply (hopefully) be:

            public static Object buildRange(RangeBuilder builder, int ll, int uu, int precisionStep) {
              // figure out where to start shift at
              int shift = ((32-1)/precisionStep) * precisionStep;
              return buildRange(builder, (long)ll, (long)uu, shift, precisionStep);
            }
          

          Does that sound right?

          Show
          Yonik Seeley added a comment - I think the range splitting code I wrote is generic enough to also automatically handle 32 bits too. I think it's just a matter of starting with a different shift? So for longs we have: public static Object buildRange(RangeBuilder builder, long ll, long uu, int precisionStep) { // figure out where to start shift at int shift = ((64-1)/precisionStep) * precisionStep; return buildRange(builder, ll, uu, shift, precisionStep); } And for ints, it would simply (hopefully) be: public static Object buildRange(RangeBuilder builder, int ll, int uu, int precisionStep) { // figure out where to start shift at int shift = ((32-1)/precisionStep) * precisionStep; return buildRange(builder, ( long )ll, ( long )uu, shift, precisionStep); } Does that sound right?
          Hide
          Uwe Schindler added a comment -

          I am still "reading" your code, but the main parts look identical (also the break condition, if no inner range available -> this condition is very important and must be min>=max). There are only differences between both are the generation of the inner range bounds. I just rewrote my old code, did you do it completely from scratch?

          I would change RangeBuilder to be only a interface/abstract class with no Object return code that has a addRange method. The range will always be or'ed (anding makes no sense). The same idea came me, when I tried to unify the 32bit and 64 bit variant in my TrieRangeFilter code.

          For performance reasons, it is better to use the same TermEnum and TermDocs and only one OpenBitSet when executing the range. But this can be easily handled using the interface (RangeBuilder initializes the Openbitset, TermDocs and TermEnums, like in my code. When the range was built, the OpenBitset is retrieved).

          Show
          Uwe Schindler added a comment - I am still "reading" your code, but the main parts look identical (also the break condition, if no inner range available -> this condition is very important and must be min>=max). There are only differences between both are the generation of the inner range bounds. I just rewrote my old code, did you do it completely from scratch? I would change RangeBuilder to be only a interface/abstract class with no Object return code that has a addRange method. The range will always be or'ed (anding makes no sense). The same idea came me, when I tried to unify the 32bit and 64 bit variant in my TrieRangeFilter code. For performance reasons, it is better to use the same TermEnum and TermDocs and only one OpenBitSet when executing the range. But this can be easily handled using the interface (RangeBuilder initializes the Openbitset, TermDocs and TermEnums, like in my code. When the range was built, the OpenBitset is retrieved).
          Hide
          Yonik Seeley added a comment -

          did you do it completely from scratch?

          Yep, took me a while... An interesting exercise and good news if it's so similar to yours. Hard code like this is actually fun too

          I would change RangeBuilder to be only a interface/abstract class with no Object return code that has a addRange method.

          Sounds fine - that means that RangeBuilder instances will always need to be stateful, but that seems flexible enough and simpler to understand than passing around Objects and casting.

          Show
          Yonik Seeley added a comment - did you do it completely from scratch? Yep, took me a while... An interesting exercise and good news if it's so similar to yours. Hard code like this is actually fun too I would change RangeBuilder to be only a interface/abstract class with no Object return code that has a addRange method. Sounds fine - that means that RangeBuilder instances will always need to be stateful, but that seems flexible enough and simpler to understand than passing around Objects and casting.
          Hide
          Uwe Schindler added a comment -

          I like your idea, I will also use a generic Range splitter (as described before), but use my algorithm for it. But I am really interested how both compare. You can test this, if you use mine and enable the System.out in setBits (which is identical to your addRange()), that returns hexadecimals for each range.

          For nicer code, I would supply two interface to build ranges (int, long), because the TrieUtils always to correctly use int/long and the user may be confused if he gets longs (even if the api is very special for experts). But it can e.g. be used to create a BooleanQuery using classic RangeQueries or'ed.

          Show
          Uwe Schindler added a comment - I like your idea, I will also use a generic Range splitter (as described before), but use my algorithm for it. But I am really interested how both compare. You can test this, if you use mine and enable the System.out in setBits (which is identical to your addRange()), that returns hexadecimals for each range. For nicer code, I would supply two interface to build ranges (int, long), because the TrieUtils always to correctly use int/long and the user may be confused if he gets longs (even if the api is very special for experts). But it can e.g. be used to create a BooleanQuery using classic RangeQueries or'ed.
          Hide
          Uwe Schindler added a comment -

          There may be one issue with signedness in your's and mine code for the special case of a precisionStep of 1. In this case the outermost/innermost calculation of the range bounds can fail, but I am not sure. I wanted to write a test for it.

          I am currently testing only 8, 4, 2 - as the results for that must be identical to my old code (I also printed a lot of System.outs in my old code to generate a exact match of everything). There are some special cases in range splitting that are very sensitive. It took me also very long time to reimplement it. But the usage of longs/ints is really nicer and cleaner than incrementing/decrementing string values, as I did before.

          It is now also possible to use a completely other encoding, without changing the range splitting (e.g. to compare 7bit chars to 13/14 bit chars from NumberUtils).

          Show
          Uwe Schindler added a comment - There may be one issue with signedness in your's and mine code for the special case of a precisionStep of 1. In this case the outermost/innermost calculation of the range bounds can fail, but I am not sure. I wanted to write a test for it. I am currently testing only 8, 4, 2 - as the results for that must be identical to my old code (I also printed a lot of System.outs in my old code to generate a exact match of everything). There are some special cases in range splitting that are very sensitive. It took me also very long time to reimplement it. But the usage of longs/ints is really nicer and cleaner than incrementing/decrementing string values, as I did before. It is now also possible to use a completely other encoding, without changing the range splitting (e.g. to compare 7bit chars to 13/14 bit chars from NumberUtils).
          Hide
          Uwe Schindler added a comment -

          Hi Yonik,

          I implemented the generic interface. I changed it a little bit (it throws now Exception, because without that the building of a range then needs wrapping the IOExceptions of IndexReader with a RuntimeException, not so nice). There are now 2 times the same implementation for 32 and 64 bit and two interfaces (long, int).

          The TrieRangeFilter is now a implementation using the LongRangeBuilder. The test, I used, is also there (just replace the files in contrib). The TrieUtilsTest is currently not working, TrieRangeQuery is obsolete.

          I think I will create a helper, that does the TermEnum/TermDocs iteration and reuse it in both LongTrieRangeFilter and IntTrieRangeFilter (maybe they get both the same superclass containing important methods useable for both).

          Show
          Uwe Schindler added a comment - Hi Yonik, I implemented the generic interface. I changed it a little bit (it throws now Exception, because without that the building of a range then needs wrapping the IOExceptions of IndexReader with a RuntimeException, not so nice). There are now 2 times the same implementation for 32 and 64 bit and two interfaces (long, int). The TrieRangeFilter is now a implementation using the LongRangeBuilder. The test, I used, is also there (just replace the files in contrib). The TrieUtilsTest is currently not working, TrieRangeQuery is obsolete. I think I will create a helper, that does the TermEnum/TermDocs iteration and reuse it in both LongTrieRangeFilter and IntTrieRangeFilter (maybe they get both the same superclass containing important methods useable for both).
          Hide
          Yonik Seeley added a comment -

          Oh I see.... one difference between our code is that you start at the lowest shift and I start at the highest. Counting up has a nice effect of getting rid of the calculation of what shift to start at... I just had a harder time thinking about the recursion in that direction.

          Anyway, it's all looking great!

          Do we have test code that tests that the most efficient precision is used (as opposed to just the right bits matching? i.e. for a precisionStep of 4
          0x300-0x4ff could be matched with 3-4 with a shift of 8, or 30-4f with a shift of 4, or 300-4ff with a shift of 0.

          Show
          Yonik Seeley added a comment - Oh I see.... one difference between our code is that you start at the lowest shift and I start at the highest. Counting up has a nice effect of getting rid of the calculation of what shift to start at... I just had a harder time thinking about the recursion in that direction. Anyway, it's all looking great! Do we have test code that tests that the most efficient precision is used (as opposed to just the right bits matching? i.e. for a precisionStep of 4 0x300-0x4ff could be matched with 3-4 with a shift of 8, or 30-4f with a shift of 4, or 300-4ff with a shift of 0.
          Hide
          Uwe Schindler added a comment -

          I was preparing the whole day the final version, including all javadoc, but then i overwrote the wron file and the whole work of today (according trie) is away. Give me one more day, and I will redo everything. I changed since yesterday a lot in the trie code. Your code was a little bit better when the range bound of one precision was exact on the range's start or end (in this case the precision could be left out, in your code the boolean needUpper and needLower). I implemented this similar.

          I also extended the interface a little bit, but this is work I have to redo. So it takes now longer. Most work is writing documentation and javadocs. If everything had worked ok (and I did not overwrite/update svn in the wrong way, I would be finished now

          Do we have test code that tests that the most efficient precision is used (as opposed to just the right bits matching? i.e. for a precisionStep of 4

          0x300-0x4ff could be matched with 3-4 with a shift of 8, or 30-4f with a shift of 4, or 300-4ff with a shift of 0.

          The most efficent precision is sometimes hard, but the optimization above with needUpper/needLower is really good sometimes (depends on the range). I think about it.

          Uwe

          Show
          Uwe Schindler added a comment - I was preparing the whole day the final version, including all javadoc, but then i overwrote the wron file and the whole work of today (according trie) is away. Give me one more day, and I will redo everything. I changed since yesterday a lot in the trie code. Your code was a little bit better when the range bound of one precision was exact on the range's start or end (in this case the precision could be left out, in your code the boolean needUpper and needLower). I implemented this similar. I also extended the interface a little bit, but this is work I have to redo. So it takes now longer. Most work is writing documentation and javadocs. If everything had worked ok (and I did not overwrite/update svn in the wrong way, I would be finished now Do we have test code that tests that the most efficient precision is used (as opposed to just the right bits matching? i.e. for a precisionStep of 4 0x300-0x4ff could be matched with 3-4 with a shift of 8, or 30-4f with a shift of 4, or 300-4ff with a shift of 0. The most efficent precision is sometimes hard, but the optimization above with needUpper/needLower is really good sometimes (depends on the range). I think about it. Uwe
          Hide
          Uwe Schindler added a comment -

          Do we have test code that tests that the most efficient precision is used (as opposed to just the right bits matching? i.e. for a precisionStep of 4
          0x300-0x4ff could be matched with 3-4 with a shift of 8, or 30-4f with a shift of 4, or 300-4ff with a shift of 0.

          I misunderstood your note. My last optimizations (now they are again restored in my svn) exactly handle this. There are two cases:

          • if the left range of a precision has ((min & mask) == 0L) [starts exactly at the beginning of the next more unprecise range], I leave out the left range for the actual precision and directly use the next lower prec
          • if the right range of a precision has ((max & mask) == mask) [ends exactly at the end of the next more unprecise range], I do the same for the right one.

          My new code is now cleaner and easier to understand (there were some other unneeded extra shifts/ands in it). I also merged the 64 bit and 32 bit range splitting and wrap the RangeBuilder classes accordingly (now abstract with two different range collecting possibilities).

          The old code was not aware of this, leading to sometimes left/right precisions that are not needed. I will add test code in TestTrieUtils, that tests the range split (without an index). The problem here is, how to test this effectively. I could generate some examples and test for the resulting range bounds using a custom XxxRangeBuilder in the test, that collects the ranges into a List and compares this list). Do you have another prossibility to test this without a lot of manually checks example ranges?

          I have now restored all my changes (and did an extra backup). I will now write again the new Javadocs of TrieUtils and package.html. After that I will post a patch with the final API. The extra and new tests will be added later. First I want to fixate and document the API.

          Show
          Uwe Schindler added a comment - Do we have test code that tests that the most efficient precision is used (as opposed to just the right bits matching? i.e. for a precisionStep of 4 0x300-0x4ff could be matched with 3-4 with a shift of 8, or 30-4f with a shift of 4, or 300-4ff with a shift of 0. I misunderstood your note. My last optimizations (now they are again restored in my svn) exactly handle this. There are two cases: if the left range of a precision has ((min & mask) == 0L) [starts exactly at the beginning of the next more unprecise range] , I leave out the left range for the actual precision and directly use the next lower prec if the right range of a precision has ((max & mask) == mask) [ends exactly at the end of the next more unprecise range] , I do the same for the right one. My new code is now cleaner and easier to understand (there were some other unneeded extra shifts/ands in it). I also merged the 64 bit and 32 bit range splitting and wrap the RangeBuilder classes accordingly (now abstract with two different range collecting possibilities). The old code was not aware of this, leading to sometimes left/right precisions that are not needed. I will add test code in TestTrieUtils, that tests the range split (without an index). The problem here is, how to test this effectively. I could generate some examples and test for the resulting range bounds using a custom XxxRangeBuilder in the test, that collects the ranges into a List and compares this list). Do you have another prossibility to test this without a lot of manually checks example ranges? I have now restored all my changes (and did an extra backup). I will now write again the new Javadocs of TrieUtils and package.html. After that I will post a patch with the final API. The extra and new tests will be added later. First I want to fixate and document the API.
          Hide
          Yonik Seeley added a comment -

          I could generate some examples and test for the resulting range bounds using a custom XxxRangeBuilder in the test, that collects the ranges into a List and compares this list).

          +1

          I think that, in conjunction with some random testing should be sufficient.

          Show
          Yonik Seeley added a comment - I could generate some examples and test for the resulting range bounds using a custom XxxRangeBuilder in the test, that collects the ranges into a List and compares this list). +1 I think that, in conjunction with some random testing should be sufficient.
          Hide
          Uwe Schindler added a comment -

          One of the trie filter test also checks, that the range is completely matched (when using a index with incrementing values). An additional test in TestTrieUtils can test for not having overlapping ranges. This can easily be tested using a OpenBitSet in which a RangeBuilder sets the bits. If it hits a bit two times, fails the test. I wrote this down here, to not forget it later, when writing this test.

          Show
          Uwe Schindler added a comment - One of the trie filter test also checks, that the range is completely matched (when using a index with incrementing values). An additional test in TestTrieUtils can test for not having overlapping ranges. This can easily be tested using a OpenBitSet in which a RangeBuilder sets the bits. If it hits a bit two times, fails the test. I wrote this down here, to not forget it later, when writing this test.
          Hide
          Uwe Schindler added a comment -

          This is the first patch for the revamp of TrieRange. It contains the complete API, a complete API documentation and some tests.

          Still missing is:

          • TestIntTrieRangeFilter
          • TestTrieUtils (the old one is still there but commented out)
          • Tests as described in previous comments (in TestTrieUtils): No range overlap, most optimized rangeSplit

          Yonik: Can you look over it and say, if this is, what you would like to have for full flexibility and Solr (which needs this full flexibility)?

          All others: Is something missing in the API (like shortcuts), any comments?

          If everything is OK, I will commit the patch after adding the important tests.

          Show
          Uwe Schindler added a comment - This is the first patch for the revamp of TrieRange. It contains the complete API, a complete API documentation and some tests. Still missing is: TestIntTrieRangeFilter TestTrieUtils (the old one is still there but commented out) Tests as described in previous comments (in TestTrieUtils): No range overlap, most optimized rangeSplit Yonik: Can you look over it and say, if this is, what you would like to have for full flexibility and Solr (which needs this full flexibility)? All others: Is something missing in the API (like shortcuts), any comments? If everything is OK, I will commit the patch after adding the important tests.
          Hide
          Uwe Schindler added a comment -

          Typos and wrong english in the docs can be fixed after the commit, merging of additional patches is simplier without all those small changes. The important things are API and functionality.

          Uwe

          Show
          Uwe Schindler added a comment - Typos and wrong english in the docs can be fixed after the commit, merging of additional patches is simplier without all those small changes. The important things are API and functionality. Uwe
          Hide
          Uwe Schindler added a comment -

          This is (hopefully) my last patch. It contains all tests and the final API (with some modifications).

          The split range test is a bit ugly, but it just test, if the algorithm works exactly like it should (but currently accepts no other order of addRange() calls when splitting the range) - and it tests only one "reference" range.

          Please give some comments, Yonik!

          Show
          Uwe Schindler added a comment - This is (hopefully) my last patch. It contains all tests and the final API (with some modifications). The split range test is a bit ugly, but it just test, if the algorithm works exactly like it should (but currently accepts no other order of addRange() calls when splitting the range) - and it tests only one "reference" range. Please give some comments, Yonik!
          Hide
          Yonik Seeley added a comment -

          Looking forward to the latest incarnation Uwe, but I'm traveling through the rest of the week... I'll definitely check it out at some point, but I liked the previous ones so you should go ahead and commit if you feel it's ready.

          Show
          Yonik Seeley added a comment - Looking forward to the latest incarnation Uwe, but I'm traveling through the rest of the week... I'll definitely check it out at some point, but I liked the previous ones so you should go ahead and commit if you feel it's ready.
          Hide
          Uwe Schindler added a comment -

          Thanks for the answer,

          My time is limited at the moment, too. I will commit Friday or the weekend!

          Uwe

          Show
          Uwe Schindler added a comment - Thanks for the answer, My time is limited at the moment, too. I will commit Friday or the weekend! Uwe
          Hide
          Uwe Schindler added a comment -

          New patch with:

          • some optimizations in the splitRange (now it works even better) and more understandable again
          • split range now also works correct for precisionStep=1
          • new tests for splitRange (special cases are also tested, e.g. Long.MIN_VALUE..Long.MAX_VALUE can easily be done with only one range on the lowest precision)

          I will commit soon.

          Show
          Uwe Schindler added a comment - New patch with: some optimizations in the splitRange (now it works even better) and more understandable again split range now also works correct for precisionStep=1 new tests for splitRange (special cases are also tested, e.g. Long.MIN_VALUE..Long.MAX_VALUE can easily be done with only one range on the lowest precision) I will commit soon.
          Hide
          Uwe Schindler added a comment -

          Committed revision #744207

          Show
          Uwe Schindler added a comment - Committed revision #744207
          Hide
          Uwe Schindler added a comment -

          I forgot: Thanks Yonik for the good ideas and discussions about API and help with coding this new trie implementation!

          Show
          Uwe Schindler added a comment - I forgot: Thanks Yonik for the good ideas and discussions about API and help with coding this new trie implementation!
          Hide
          Yonik Seeley added a comment -

          OK, I got a chance to further check things out and do some manual testing to ensure that the most efficient forms are always used. Everything looks good!

          Show
          Yonik Seeley added a comment - OK, I got a chance to further check things out and do some manual testing to ensure that the most efficient forms are always used. Everything looks good!
          Hide
          Uwe Schindler added a comment - - edited

          Cool. So you like the new code? Are you happy with the API and how it works, is it good to use in Solr? If yes, it would be great and I am happy, that it is of use

          There seem to be no problems, I converted my own code to use the new TrieRange API and I like it. No problems. 8bit precisionStep works good for me, my index with 13 numeric trie fields and 600,000 docs works fine, no performance differences, queries run amazingly fast. Index size is almost identical.

          I hope I will get some synthetic performance the next days, do you have some code for the performance contrib to check performance (I am not so familar with the performance code, I will check it out).

          Uwe

          Show
          Uwe Schindler added a comment - - edited Cool. So you like the new code? Are you happy with the API and how it works, is it good to use in Solr? If yes, it would be great and I am happy, that it is of use There seem to be no problems, I converted my own code to use the new TrieRange API and I like it. No problems. 8bit precisionStep works good for me, my index with 13 numeric trie fields and 600,000 docs works fine, no performance differences, queries run amazingly fast. Index size is almost identical. I hope I will get some synthetic performance the next days, do you have some code for the performance contrib to check performance (I am not so familar with the performance code, I will check it out). Uwe
          Hide
          Ning Li added a comment -

          Good stuff!

          Is it worth to also have an option to specify the number of precisions to index a value?

          With a large precision step (say 8), a value is indexed in fewer terms (8) but the number of terms for a range can be large. With a small precision step (say 2), the number of terms for a range is smaller but a value is indexed in more terms (32). With precision step 2 and number of precisions set to 24, the number of terms for a range is still quite small but a value is indexed in 24 terms instead of 32. For applications usually query small ranges, the number of precisions can be further reduced.

          We can provide more options to make things more flexible. But we probably want a balance of flexibility vs. the complexity of user options. Does this number of precisions look like a good one?

          Show
          Ning Li added a comment - Good stuff! Is it worth to also have an option to specify the number of precisions to index a value? With a large precision step (say 8), a value is indexed in fewer terms (8) but the number of terms for a range can be large. With a small precision step (say 2), the number of terms for a range is smaller but a value is indexed in more terms (32). With precision step 2 and number of precisions set to 24, the number of terms for a range is still quite small but a value is indexed in 24 terms instead of 32. For applications usually query small ranges, the number of precisions can be further reduced. We can provide more options to make things more flexible. But we probably want a balance of flexibility vs. the complexity of user options. Does this number of precisions look like a good one?
          Hide
          Uwe Schindler added a comment -

          Hi Ning,
          thanks for suggesting. I was thinking abou that, too. In general an idea, would be to use 32 bit integers or floats, if you do not need that much accuracy. In this case, the number of terms is reduced, too.
          But it may be a good option, to specify a option, that values are indexed with the most possible precision and additionally indexed with lower precision values, too. But The precision step may be dynamic, like:
          a) precision step gets bigger for lower precisions
          b) after a precision of XXbits no mor lower precisions are generated and queried. This may be possible to implement by e.g. an array of precision step values that give the splitting of the whole long/int into different precisions (like 2-2-2-2-8-8-8-8-8-16, so precisie values use 2 bit precision step, e.g. from shift 0 to 2, but from shift 48 to 64 a step value of 16 is used).

          Uwe

          Show
          Uwe Schindler added a comment - Hi Ning, thanks for suggesting. I was thinking abou that, too. In general an idea, would be to use 32 bit integers or floats, if you do not need that much accuracy. In this case, the number of terms is reduced, too. But it may be a good option, to specify a option, that values are indexed with the most possible precision and additionally indexed with lower precision values, too. But The precision step may be dynamic, like: a) precision step gets bigger for lower precisions b) after a precision of XXbits no mor lower precisions are generated and queried. This may be possible to implement by e.g. an array of precision step values that give the splitting of the whole long/int into different precisions (like 2-2-2-2-8-8-8-8-8-16, so precisie values use 2 bit precision step, e.g. from shift 0 to 2, but from shift 48 to 64 a step value of 16 is used). Uwe
          Hide
          Ning Li added a comment -

          Hi Uwe,

          I had something similar in mind when I said we can "make things more flexible". Do you think it'll be too complex for users to specify? On the other hand, this is for experts so let experts have all the flexibility. We can open a different JIRA issue if we decide to go for it.

          Show
          Ning Li added a comment - Hi Uwe, I had something similar in mind when I said we can "make things more flexible". Do you think it'll be too complex for users to specify? On the other hand, this is for experts so let experts have all the flexibility. We can open a different JIRA issue if we decide to go for it.
          Hide
          Uwe Schindler added a comment - - edited

          In principle it could be in the same way like the field names in the API: An array of precision step values as parameter where now only a single precisionStep is used. A shortcut would be the current API, that internally passes a one-item-array with the only precisionStep (if the array is shorter, the same logic with Math.min like on the field array). The simple-user API has only (like the current), one fieldname and one precision step, the full feaured api has an array of field names for each step and an array of precision step values.
          But the problem with all this is, that the api gets complexer and complexer, so the simple shortcuts should also be provided and should be recommended.

          Show
          Uwe Schindler added a comment - - edited In principle it could be in the same way like the field names in the API: An array of precision step values as parameter where now only a single precisionStep is used. A shortcut would be the current API, that internally passes a one-item-array with the only precisionStep (if the array is shorter, the same logic with Math.min like on the field array). The simple-user API has only (like the current), one fieldname and one precision step, the full feaured api has an array of field names for each step and an array of precision step values. But the problem with all this is, that the api gets complexer and complexer, so the simple shortcuts should also be provided and should be recommended.
          Hide
          Ning Li added a comment -

          Agree. Do you want to open a new issue? If you want, I can take a crack at it, but probably sometime next week.

          Show
          Ning Li added a comment - Agree. Do you want to open a new issue? If you want, I can take a crack at it, but probably sometime next week.
          Hide
          Uwe Schindler added a comment -

          A new issue would be good, can you open one? The idea for the patch is almost finished, I can attach a patch shortly. There are some minor things to solve and think about, but its not a big thing.

          Show
          Uwe Schindler added a comment - A new issue would be good, can you open one? The idea for the patch is almost finished, I can attach a patch shortly. There are some minor things to solve and think about, but its not a big thing.
          Hide
          Uwe Schindler added a comment -

          Removed the splitRange recursion and replaced by a simple loop. Committed rev #745533

          Show
          Uwe Schindler added a comment - Removed the splitRange recursion and replaced by a simple loop. Committed rev #745533
          Hide
          Uwe Schindler added a comment -

          A change for a more universal RangeBuilder API (as preparation for LUCENE-1541). This patch also included the last commit that removes the recursion (for completeness).

          Show
          Uwe Schindler added a comment - A change for a more universal RangeBuilder API (as preparation for LUCENE-1541 ). This patch also included the last commit that removes the recursion (for completeness).
          Hide
          Uwe Schindler added a comment -

          Committed revision 746790.

          Show
          Uwe Schindler added a comment - Committed revision 746790.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development