Lucene - Core
  1. Lucene - Core
  2. LUCENE-2380

Add FieldCache.getTermBytes, to load term data as byte[]

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode string, but not necessarily), so we need to push this up the search stack.

      FieldCache now has getStrings and getStringIndex; we need corresponding methods to load terms as native byte[], since in general they may not be representable as String. This should be quite a bit more RAM efficient too, for US ascii content since each character would then use 1 byte not 2.

      1. LUCENE-2380.patch
        92 kB
        Michael McCandless
      2. LUCENE-2380.patch
        114 kB
        Michael McCandless
      3. LUCENE-2380.patch
        129 kB
        Michael McCandless
      4. LUCENE-2380.patch
        133 kB
        Michael McCandless
      5. LUCENE-2380_enum.patch
        7 kB
        Yonik Seeley
      6. LUCENE-2380_enum.patch
        9 kB
        Yonik Seeley
      7. LUCENE-2380_direct_arr_access.patch
        5 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment - - edited

          It was really tricky performance testing this.

          If I started solr and tested one type of faceting exclusively, the performance impact of going through the new FieldCache interfaces (PackedInts for ord lookup) was relatively minimal.

          However, I had a simple script that tested the different variants (the 4 in the table above)... and using that resulted in the bigger slowdowns.

          The script would do the following:

          1) test 100 iterations of facet.method=fc on the 100,000 term field
          2) test 10 iterations of facet.method=fcs on the 100,000 term field
          3) test 100 iterations of facet.method=fc on the 100 term field
          4) test 10 iterations of facet.method=fcs on the 100 term field
          

          I would run the script a few times, making sure the numbers stabilized and were repeatable.

          Testing #1 alone resulted in trunk slowing down ~ 4%
          Testing #1 along with any single other test: same small slowdown of ~4%
          Running the complete script: slowdown of 33-38% for #1 (as well as others)
          When running the complete script, the first run of Test #1 was always the best... as if the JVM correctly specialized it, but then discarded it later, never to return.
          I saw the same affect on both an AMD Phenom II w/ ubuntu, Java 1.6_14 and Win7 with a Core2, Java 1.6_17, both 64 bit. The drop on Win7 was only 20% though.

          So: you can't always depend on the JVM being able to inline stuff for you, and it seems very hard to determine when it can.
          This obviously has implications for the lucene benchmarker too.

          Show
          Yonik Seeley added a comment - - edited It was really tricky performance testing this. If I started solr and tested one type of faceting exclusively, the performance impact of going through the new FieldCache interfaces (PackedInts for ord lookup) was relatively minimal. However, I had a simple script that tested the different variants (the 4 in the table above)... and using that resulted in the bigger slowdowns. The script would do the following: 1) test 100 iterations of facet.method=fc on the 100,000 term field 2) test 10 iterations of facet.method=fcs on the 100,000 term field 3) test 100 iterations of facet.method=fc on the 100 term field 4) test 10 iterations of facet.method=fcs on the 100 term field I would run the script a few times, making sure the numbers stabilized and were repeatable. Testing #1 alone resulted in trunk slowing down ~ 4% Testing #1 along with any single other test: same small slowdown of ~4% Running the complete script: slowdown of 33-38% for #1 (as well as others) When running the complete script, the first run of Test #1 was always the best... as if the JVM correctly specialized it, but then discarded it later, never to return. I saw the same affect on both an AMD Phenom II w/ ubuntu, Java 1.6_14 and Win7 with a Core2, Java 1.6_17, both 64 bit. The drop on Win7 was only 20% though. So: you can't always depend on the JVM being able to inline stuff for you, and it seems very hard to determine when it can. This obviously has implications for the lucene benchmarker too.
          Hide
          Yonik Seeley added a comment -

          This patch adds the ability to get at the raw arrays from the Direct* classes, and using those fixes the performance regressions in the "fc" faceting I was seeing.

          To do this, it adds this to DocTermsIndex. Anyone have a better solution?

              /** @lucene.internal */
              public abstract PackedInts.Reader getDocToOrd();
          
          Show
          Yonik Seeley added a comment - This patch adds the ability to get at the raw arrays from the Direct* classes, and using those fixes the performance regressions in the "fc" faceting I was seeing. To do this, it adds this to DocTermsIndex. Anyone have a better solution? /** @lucene.internal */ public abstract PackedInts.Reader getDocToOrd();
          Hide
          Michael McCandless added a comment -

          The above commit was actually for LUCENE-2378.

          Show
          Michael McCandless added a comment - The above commit was actually for LUCENE-2378 .
          Hide
          Yonik Seeley added a comment -
          terms in field facet method pre-bytes ms trunk+patch ms new/old
          100000 fc 27 36 1.33
          100000 fcs 333 325 0.98
          100 fc 20 22 1.10
          100 fcs 24 25 1.04

          OK - so the biggest problem area initially (bottlenecked by field cache merging) that was 55% slower is now 2% faster.

          Show
          Yonik Seeley added a comment - terms in field facet method pre-bytes ms trunk+patch ms new/old 100000 fc 27 36 1.33 100000 fcs 333 325 0.98 100 fc 20 22 1.10 100 fcs 24 25 1.04 OK - so the biggest problem area initially (bottlenecked by field cache merging) that was 55% slower is now 2% faster.
          Hide
          Michael McCandless added a comment -

          Patch looks good Yonik!

          Show
          Michael McCandless added a comment - Patch looks good Yonik!
          Hide
          Yonik Seeley added a comment -

          Here's an updated "terms enum over fieldcache" patch.
          PagedBytes now keeps track of how much space was used in each byte array and allows access to the raw blocks and end info. Slightly less elegant, but it works.

          I still need to do performance testing with this.

          Show
          Yonik Seeley added a comment - Here's an updated "terms enum over fieldcache" patch. PagedBytes now keeps track of how much space was used in each byte array and allows access to the raw blocks and end info. Slightly less elegant, but it works. I still need to do performance testing with this.
          Hide
          Yonik Seeley added a comment -

          Here's a draft patch (it currently fails) of an enumerator over the field cache entry.

          Show
          Yonik Seeley added a comment - Here's a draft patch (it currently fails) of an enumerator over the field cache entry.
          Hide
          Yonik Seeley added a comment -

          FYI, while trying to implement an iterator over the fieldcache terms, I ran into a bug where each term is written twice. This causes double the memory usage for the bytes (but no functionality bugs). I'll fix shortly, and anyone who has done performance tests might want to redo them again (cache effects, GC differences, and bigger entry build times).

          Show
          Yonik Seeley added a comment - FYI, while trying to implement an iterator over the fieldcache terms, I ran into a bug where each term is written twice. This causes double the memory usage for the bytes (but no functionality bugs). I'll fix shortly, and anyone who has done performance tests might want to redo them again (cache effects, GC differences, and bigger entry build times).
          Hide
          Yonik Seeley added a comment -

          I just committed a patch that helps... when merging the fieldcaches, instead of looking up the term for each comparison, it's now stored in the segment data structure.

          Per-segment faceting is now 26% slower for the 100,000 term field, and 17% slower for the 100 term field.

          One way to regain more performance is to implement some kind of stateful iterator over the values in the field cache entry instead of looking up by ord each time.

          Show
          Yonik Seeley added a comment - I just committed a patch that helps... when merging the fieldcaches, instead of looking up the term for each comparison, it's now stored in the segment data structure. Per-segment faceting is now 26% slower for the 100,000 term field, and 17% slower for the 100 term field. One way to regain more performance is to implement some kind of stateful iterator over the values in the field cache entry instead of looking up by ord each time.
          Hide
          Yonik Seeley added a comment -

          What do the numbers mean?

          Time to do the faceting (roughly). FieldCache build time is not included. Given that the degradation is much worse for a higher number of unique values, this points to the increased cost of going from ord->value.

          Show
          Yonik Seeley added a comment - What do the numbers mean? Time to do the faceting (roughly). FieldCache build time is not included. Given that the degradation is much worse for a higher number of unique values, this points to the increased cost of going from ord->value.
          Hide
          Michael McCandless added a comment -

          Hmmmm.

          Can you try adding ", true" to FieldCache.DEFAULT.getTermsIndex? That'll use more RAM but should be faster.

          Also, could the fix for executor have changed the performance?

          Show
          Michael McCandless added a comment - Hmmmm. Can you try adding ", true" to FieldCache.DEFAULT.getTermsIndex? That'll use more RAM but should be faster. Also, could the fix for executor have changed the performance?
          Hide
          Uwe Schindler added a comment -

          What do the numbers mean? Time to build cache or time for sorting something? Thats unclear to me.

          Show
          Uwe Schindler added a comment - What do the numbers mean? Time to build cache or time for sorting something? Thats unclear to me.
          Hide
          Yonik Seeley added a comment -

          I did some performance testing on faceting using the field cache (single valued field with facet.method fc and fcs).

          field=100000 unique values
          fc: 5% slower
          fcs: 55% slower

          field=100 unique values
          fc: 2.5% slower
          fcs: 26% slower

          I'll look into it to see how we can regain some of that lost performance.

          Show
          Yonik Seeley added a comment - I did some performance testing on faceting using the field cache (single valued field with facet.method fc and fcs). field=100000 unique values fc: 5% slower fcs: 55% slower field=100 unique values fc: 2.5% slower fcs: 26% slower I'll look into it to see how we can regain some of that lost performance.
          Hide
          Yonik Seeley added a comment -

          Whew, that's one involved patch!
          I didn't get to it before, but I'll start looking over the Solr changes now.

          Show
          Yonik Seeley added a comment - Whew, that's one involved patch! I didn't get to it before, but I'll start looking over the Solr changes now.
          Hide
          Michael McCandless added a comment -

          I opened LUCENE-2483 for the future improvements.

          Show
          Michael McCandless added a comment - I opened LUCENE-2483 for the future improvements.
          Hide
          Michael McCandless added a comment -

          OK I fixed up the patch. I think it's ready to commit, though it'd be
          great if someone could double check my Solr changes...:

          • Updated to trunk
          • Fixed bug in Solr's ByteUtils.java (it was not respecting the
            offset in the incoming BytesRef)
          • Added optional boolean "fasterButMoreRAM" option when loading
            field cache, defaults to true
          • For DocTermsIndex, I defined ord=0 to mean "unset"; and made it
            the caller's responsibility to do something with the ord=0 case if
            empty (length=0) BytesRef isn't acceptable. Likewise, for
            DocTerms, I now directly return empty BytesRef if doc didn't have
            this field, but I also added an exists method to explicitly check
            if you need to.
          • Added a getTerm convenience method (calls getOrd then lookup, by
            default) to the terms index; renamed DocTerms.get -> getTerm for
            consistency
          • Fixed the nocommits and/or changed to TODOs
          • Small cleanups

          I've also added a MIGRATE.txt that spells out more details on how an
          app can cutover to the new APIs.

          I think there are some other good things to do here, but as a future
          issue (this one's big enough!) – I'll open it:

          • For DocTermsIndex, make it optional whether the bytes data is
            loaded. EG for a single segment index (LUCENE-2335), or for sort
            comparators apps that do not need the bytes data (eg because they
            use terms dict to resolve ord -> term, and v/v).
          • Possibly merge DocTerms & DocTermsIndex. EG it's dangerous today
            if you load terms and then termsIndex because you're wasting tons
            of RAM; it'd be nicer if we could have a single cache entry that'd
            "upgrade" itself to be an index (have the ords).
          Show
          Michael McCandless added a comment - OK I fixed up the patch. I think it's ready to commit, though it'd be great if someone could double check my Solr changes...: Updated to trunk Fixed bug in Solr's ByteUtils.java (it was not respecting the offset in the incoming BytesRef) Added optional boolean "fasterButMoreRAM" option when loading field cache, defaults to true For DocTermsIndex, I defined ord=0 to mean "unset"; and made it the caller's responsibility to do something with the ord=0 case if empty (length=0) BytesRef isn't acceptable. Likewise, for DocTerms, I now directly return empty BytesRef if doc didn't have this field, but I also added an exists method to explicitly check if you need to. Added a getTerm convenience method (calls getOrd then lookup, by default) to the terms index; renamed DocTerms.get -> getTerm for consistency Fixed the nocommits and/or changed to TODOs Small cleanups I've also added a MIGRATE.txt that spells out more details on how an app can cutover to the new APIs. I think there are some other good things to do here, but as a future issue (this one's big enough!) – I'll open it: For DocTermsIndex, make it optional whether the bytes data is loaded. EG for a single segment index ( LUCENE-2335 ), or for sort comparators apps that do not need the bytes data (eg because they use terms dict to resolve ord -> term, and v/v). Possibly merge DocTerms & DocTermsIndex. EG it's dangerous today if you load terms and then termsIndex because you're wasting tons of RAM; it'd be nicer if we could have a single cache entry that'd "upgrade" itself to be an index (have the ords).
          Hide
          Michael McCandless added a comment -

          I did some rough estimates of RAM usage for StringIndex (trunk) vs
          TermIndex (patch).

          Java String is an object, so estimate 8 byte object header in the JRE.
          It seems to have 3 int fields (offset, count, hashCode), from
          OpenJDK's sources, plus ref to char[].

          The char[] has 8 byte object header, int length, and actual array
          data.

          So in trunk's StringIndex:

          per-unique-term: 40 bytes (48 on 64bit jre) + 2*length-of-string-in-UTF16
          per-doc: 4 bytes (8 bytes on 64 bit)

          In the patch:

          per-unique-term: ceil(log2(totalUTF8BytesTermData)) + utf8 bytes + 1 or 2 bytes (vInt, for term length)
          per-doc: ceil(log2(numUniqueTerm)) bits

          So eg say you have an English title field, avg length 40 chars, and
          assume always unique. On a 5M doc index, trunk would take ~591MB and
          patch would take ~226 MB (32bit JRE) = 62% less.

          But if you have a CJK title field, avg 10 chars (may be highish), it's
          less savings because UTF8 takes 50% more RAM than UTF16 does for CJK
          (and others). Trunk would take ~305MB and patch ~178MB (32bit JRE) =
          42% less.

          Also don't forget the GC load of having 5M String & char[] objects...

          Show
          Michael McCandless added a comment - I did some rough estimates of RAM usage for StringIndex (trunk) vs TermIndex (patch). Java String is an object, so estimate 8 byte object header in the JRE. It seems to have 3 int fields (offset, count, hashCode), from OpenJDK's sources, plus ref to char[]. The char[] has 8 byte object header, int length, and actual array data. So in trunk's StringIndex: per-unique-term: 40 bytes (48 on 64bit jre) + 2*length-of-string-in-UTF16 per-doc: 4 bytes (8 bytes on 64 bit) In the patch: per-unique-term: ceil(log2(totalUTF8BytesTermData)) + utf8 bytes + 1 or 2 bytes (vInt, for term length) per-doc: ceil(log2(numUniqueTerm)) bits So eg say you have an English title field, avg length 40 chars, and assume always unique. On a 5M doc index, trunk would take ~591MB and patch would take ~226 MB (32bit JRE) = 62% less. But if you have a CJK title field, avg 10 chars (may be highish), it's less savings because UTF8 takes 50% more RAM than UTF16 does for CJK (and others). Trunk would take ~305MB and patch ~178MB (32bit JRE) = 42% less. Also don't forget the GC load of having 5M String & char[] objects...
          Hide
          Michael McCandless added a comment -

          OK I ran some sort perf tests. I picked the worst case – trivial
          query (TermQuery) matching all docs, sorting by either a highly unique
          string field (random string) or enumerated field (country ~ a couple
          hundred values), from benchmark's SortableSingleDocSource.

          Index has 5M docs. Each run is best of 3.

          Results:

          Sort Trunk QPS Patch QPS Change %
          random 7.75 5.64 -27.2%
          country 8.05 7.62 -5.3%

          So.... the packed ints lookups are more costly than trunk today (but,
          at a large reduction in RAM used).

          Then I tried another test, asking packed ints to upgrade to an array
          of the nearest native type (ie byte[], short[], int[], long[]) for the
          doc -> ord map. This is faster since lookups don't require
          shift/mask, but, wastes some space since you have unused bits:

          Sort Trunk QPS Patch QPS Change %
          random 7.75 7.89 1.8%
          country 8.05 7.64 -5.1%

          The country case didn't get any better (noise) because it happened to
          already be using 8 bits (byte[]) for doc->ord map.

          Remember this is a worst case test – if you query matches fewer
          results than your entire index, or your query is more costly to
          evaluate than the simple single TermQuery, this FieldCache lookup cost
          will be relatively smaller.

          So... I think we should expose in the new FieldCache methods an
          optional param to control time/space tradeoff; I'll add this,
          defaulting to upgrading to nearest native type. I think the 5.3%
          slowdown on the country field is acceptable given the large reduction
          in RAM used...

          Show
          Michael McCandless added a comment - OK I ran some sort perf tests. I picked the worst case – trivial query (TermQuery) matching all docs, sorting by either a highly unique string field (random string) or enumerated field (country ~ a couple hundred values), from benchmark's SortableSingleDocSource. Index has 5M docs. Each run is best of 3. Results: Sort Trunk QPS Patch QPS Change % random 7.75 5.64 -27.2% country 8.05 7.62 -5.3% So.... the packed ints lookups are more costly than trunk today (but, at a large reduction in RAM used). Then I tried another test, asking packed ints to upgrade to an array of the nearest native type (ie byte[], short[], int[], long[]) for the doc -> ord map. This is faster since lookups don't require shift/mask, but, wastes some space since you have unused bits: Sort Trunk QPS Patch QPS Change % random 7.75 7.89 1.8% country 8.05 7.64 -5.1% The country case didn't get any better (noise) because it happened to already be using 8 bits (byte[]) for doc->ord map. Remember this is a worst case test – if you query matches fewer results than your entire index, or your query is more costly to evaluate than the simple single TermQuery, this FieldCache lookup cost will be relatively smaller. So... I think we should expose in the new FieldCache methods an optional param to control time/space tradeoff; I'll add this, defaulting to upgrading to nearest native type. I think the 5.3% slowdown on the country field is acceptable given the large reduction in RAM used...
          Hide
          Michael McCandless added a comment -

          New patch – now all tests pass. Getting closer... but I still have to perf tes...

          Show
          Michael McCandless added a comment - New patch – now all tests pass. Getting closer... but I still have to perf tes...
          Hide
          Michael McCandless added a comment -

          New iteration attached.

          I got Solr mostly cutover, at least for the immediate usage of FieldCache.getStrings/getStringIndex.

          However one Solr test (TestDistributedSearch) is still failing...

          Show
          Michael McCandless added a comment - New iteration attached. I got Solr mostly cutover, at least for the immediate usage of FieldCache.getStrings/getStringIndex. However one Solr test (TestDistributedSearch) is still failing...
          Hide
          Yonik Seeley added a comment -

          would love to return empty string (not null) if ord 0 comes in, and require caller to specifically handle ord 0 if they need to differentiate... I had started down that path but got spooked by it

          Yeah... I guess I could see how it could cause a loss of info if you go though a few layers and you only have a BytesRef w/o an ord to tell you the value was missing.

          Show
          Yonik Seeley added a comment - would love to return empty string (not null) if ord 0 comes in, and require caller to specifically handle ord 0 if they need to differentiate... I had started down that path but got spooked by it Yeah... I guess I could see how it could cause a loss of info if you go though a few layers and you only have a BytesRef w/o an ord to tell you the value was missing.
          Hide
          Michael McCandless added a comment -

          Ahhh, fun stuff! I'm packing for Prague though - prob won't be able to look at this for a week.

          OK no prob... have fun!

          1 or 2? a max len of 2**15? (I know... a term bigger than 32K would be horrible, but so are limits that aren't necessary)

          Indexer already has this limit, during indexing (these large terms are skipped).

          re: returning null if an ord of 0 is passed to get(int ord, BytesRef ret): do we need to do this? We could record 0 as zero length in the FieldCache and hence avoid the special-case code. We could require the user to check for 0 if they care to know the difference between zero length and missing.

          I would love to return empty string (not null) if ord 0 comes in, and require caller to specifically handle ord 0 if they need to differentiate... I had started down that path but got spooked by it I think we can revisit it, but maybe separately.

          Show
          Michael McCandless added a comment - Ahhh, fun stuff! I'm packing for Prague though - prob won't be able to look at this for a week. OK no prob... have fun! 1 or 2? a max len of 2**15? (I know... a term bigger than 32K would be horrible, but so are limits that aren't necessary) Indexer already has this limit, during indexing (these large terms are skipped). re: returning null if an ord of 0 is passed to get(int ord, BytesRef ret): do we need to do this? We could record 0 as zero length in the FieldCache and hence avoid the special-case code. We could require the user to check for 0 if they care to know the difference between zero length and missing. I would love to return empty string (not null) if ord 0 comes in, and require caller to specifically handle ord 0 if they need to differentiate... I had started down that path but got spooked by it I think we can revisit it, but maybe separately.
          Hide
          Yonik Seeley added a comment -

          Ahhh, fun stuff! I'm packing for Prague though - prob won't be able to look at this for a week.

          1 or 2 byte vInt prefix

          1 or 2? a max len of 2**15? (I know... a term bigger than 32K would be horrible, but so are limits that aren't necessary). We could also do 1 or 4 (or 1 or 5), but as long as we make sure the single-byte case is optimized, it shouldn't matter.

          re: returning null if an ord of 0 is passed to get(int ord, BytesRef ret): do we need to do this? We could record 0 as zero length in the FieldCache and hence avoid the special-case code. We could require the user to check for 0 if they care to know the difference between zero length and missing.

          Show
          Yonik Seeley added a comment - Ahhh, fun stuff! I'm packing for Prague though - prob won't be able to look at this for a week. 1 or 2 byte vInt prefix 1 or 2? a max len of 2**15? (I know... a term bigger than 32K would be horrible, but so are limits that aren't necessary). We could also do 1 or 4 (or 1 or 5), but as long as we make sure the single-byte case is optimized, it shouldn't matter. re: returning null if an ord of 0 is passed to get(int ord, BytesRef ret): do we need to do this? We could record 0 as zero length in the FieldCache and hence avoid the special-case code. We could require the user to check for 0 if they care to know the difference between zero length and missing.
          Hide
          Michael McCandless added a comment -

          Very rough first cut patch attached.

          I removed getStrings and replaced it with getTerms (returns a BytesRef oriented getter API for looking up the String from a doc).

          And I removed getStringsIndex and replaced it with getTermsIndex.

          All lucene tests pass with this hard cutover, but I still need to measure perf hit: I'm using packed ints currently to hold the offsets, and 1 or 2 byte vInt prefix (Yonik's idea) to encode the term's length.

          I started to cutover Solr as well, and got some things cutover, but decided I should stop and check if these changes make sense So the Solr side of the patch does not yet compile. I specifically stopped when I got to StringIndexDocValues (abstract base class for lots of others) – I'd like to do a hard cutover of this class to use BytesRef, but does anyone see a problem w/ that? I'm at little nervous about how far "down" I'll end up having to push the BytesRef... It is marked as "Internal class, subject to change" I also dropped lots of nocommits along the way on the Solr side... so if someone could have a look and make suggestions that'd help

          Show
          Michael McCandless added a comment - Very rough first cut patch attached. I removed getStrings and replaced it with getTerms (returns a BytesRef oriented getter API for looking up the String from a doc). And I removed getStringsIndex and replaced it with getTermsIndex. All lucene tests pass with this hard cutover, but I still need to measure perf hit: I'm using packed ints currently to hold the offsets, and 1 or 2 byte vInt prefix (Yonik's idea) to encode the term's length. I started to cutover Solr as well, and got some things cutover, but decided I should stop and check if these changes make sense So the Solr side of the patch does not yet compile. I specifically stopped when I got to StringIndexDocValues (abstract base class for lots of others) – I'd like to do a hard cutover of this class to use BytesRef, but does anyone see a problem w/ that? I'm at little nervous about how far "down" I'll end up having to push the BytesRef... It is marked as "Internal class, subject to change" I also dropped lots of nocommits along the way on the Solr side... so if someone could have a look and make suggestions that'd help
          Hide
          Michael McCandless added a comment -

          I agree, let's pass the reused BytesRef in.

          Show
          Michael McCandless added a comment - I agree, let's pass the reused BytesRef in.
          Hide
          Yonik Seeley added a comment -

          One thing to keep in mind is that the current way of returning shared BytesRef objects often forces one to make a copy. We should perhaps consider allowing a BytesRef to be passed in.

          // returning shared BytesRef forces a copy
          for(;;) {
            BytesRef val1 = new BytesRef(getValue(doc1))  // make a copy
            BytesRef val2 = getValue(doc2)
            int cmp = val1.compareTo(val2)
          
          // allowing BytesRef to be passed in means no copy
          BytesRef val1 = new BytesRef();
          BytesRef val2 = new BytesRef();
          for(;;) {
            getValue(doc1, val1)
            getValue(doc2, val2)
            int cmp = val1.compareTo(val2)
          }
          
          Show
          Yonik Seeley added a comment - One thing to keep in mind is that the current way of returning shared BytesRef objects often forces one to make a copy. We should perhaps consider allowing a BytesRef to be passed in. // returning shared BytesRef forces a copy for (;;) { BytesRef val1 = new BytesRef(getValue(doc1)) // make a copy BytesRef val2 = getValue(doc2) int cmp = val1.compareTo(val2) // allowing BytesRef to be passed in means no copy BytesRef val1 = new BytesRef(); BytesRef val2 = new BytesRef(); for (;;) { getValue(doc1, val1) getValue(doc2, val2) int cmp = val1.compareTo(val2) }
          Hide
          Toke Eskildsen added a comment -

          Working on LUCENE-2369 I essentially had to re-implement the FieldCache because of the hardwiring of arrays. Switching to accessor methods seems like the right direction to go.

          Show
          Toke Eskildsen added a comment - Working on LUCENE-2369 I essentially had to re-implement the FieldCache because of the hardwiring of arrays. Switching to accessor methods seems like the right direction to go.
          Hide
          Uwe Schindler added a comment -

          This goes again in the direction of not having arrays in FieldCache anymore, but instead have accessor methods taking a docid and giving back the data (possibly as a reference). So getBytes(docid) returns a reused BytesRef with offset and length of the requested term. For native types we should also go away from arrays and only provide accessor methods. Java is so fast and possiby inlines the method call. So for native types we could also use a FloatBuffer or ByteBuffer or whatever from java.nio.

          Show
          Uwe Schindler added a comment - This goes again in the direction of not having arrays in FieldCache anymore, but instead have accessor methods taking a docid and giving back the data (possibly as a reference). So getBytes(docid) returns a reused BytesRef with offset and length of the requested term. For native types we should also go away from arrays and only provide accessor methods. Java is so fast and possiby inlines the method call. So for native types we could also use a FloatBuffer or ByteBuffer or whatever from java.nio.
          Hide
          Yonik Seeley added a comment -

          We could also do shared byte[] blocks (private), with a public method to retrieve the BytesRef for a given doc?

          Absolutely! Now that we are in control, it would be a crime not not share the byte[]
          Seems like one should pass in a BytesRef to be filled in... that would be most efficient for people doing simple stuff like compare docid1 to docid2. Returning a reused BytesRef could also work (as TermsEnum does) but it's less efficient for anything needing a state of more than 1 BytesRef since it then requires copying.

          We can further save space by putting the length as a vInt in the byte[] - most would be a single byte.
          Then we just need an int[] as an index into the byte[]... or potentially packed ints.

          We'll also need an implementation that can span multiple byte[]s for larger than 2GB support. The correct byte[] to look into is then simply a function of the docid (as is done in Solr's UnInvertedField).

          We could possibly play games with the offsets into the byte[] too - encode as a delta against the average instead of an absolute offset. So offset = average_size * ord + get_delta(ord)

          Show
          Yonik Seeley added a comment - We could also do shared byte[] blocks (private), with a public method to retrieve the BytesRef for a given doc? Absolutely! Now that we are in control, it would be a crime not not share the byte[] Seems like one should pass in a BytesRef to be filled in... that would be most efficient for people doing simple stuff like compare docid1 to docid2. Returning a reused BytesRef could also work (as TermsEnum does) but it's less efficient for anything needing a state of more than 1 BytesRef since it then requires copying. We can further save space by putting the length as a vInt in the byte[] - most would be a single byte. Then we just need an int[] as an index into the byte[]... or potentially packed ints. We'll also need an implementation that can span multiple byte[]s for larger than 2GB support. The correct byte[] to look into is then simply a function of the docid (as is done in Solr's UnInvertedField). We could possibly play games with the offsets into the byte[] too - encode as a delta against the average instead of an absolute offset. So offset = average_size * ord + get_delta(ord)
          Hide
          Michael McCandless added a comment -

          We could also do shared byte[] blocks (private), with a public method to retrieve the BytesRef for a given doc? Standard codec's terms index does this – we could share it I think.

          A new byte[] per doc adds alot of RAM overhead and GC load. (Of course, so does the String solution we use today, so it'd at least be no worse...).

          Show
          Michael McCandless added a comment - We could also do shared byte[] blocks (private), with a public method to retrieve the BytesRef for a given doc? Standard codec's terms index does this – we could share it I think. A new byte[] per doc adds alot of RAM overhead and GC load. (Of course, so does the String solution we use today, so it'd at least be no worse...).
          Hide
          Uwe Schindler added a comment -

          The structure should look like String and StringIndex, but I am not sure, if we need real BytesRefs. In my opinion, it should be an array of byte[], where each byte[] is allocated with the termsize from the enums BytesRef and copied over - this is. This is no problem, as the terms need to be replicated either way, as the BytesRef from the enum is reused. The only problem is that byte[] is mising the cool bytesref methods like utf8ToString() that may be needed by consumers.

          getStrings and getStringIndex should be deprecated. We cannot emulate them using BytesRef.utf8ToString, as the String[] arrays are raw and allow no wrapping. If FieldCache would use accessor methods and not raw arrays, we would not have that problem...

          Show
          Uwe Schindler added a comment - The structure should look like String and StringIndex, but I am not sure, if we need real BytesRefs. In my opinion, it should be an array of byte[], where each byte[] is allocated with the termsize from the enums BytesRef and copied over - this is. This is no problem, as the terms need to be replicated either way, as the BytesRef from the enum is reused. The only problem is that byte[] is mising the cool bytesref methods like utf8ToString() that may be needed by consumers. getStrings and getStringIndex should be deprecated. We cannot emulate them using BytesRef.utf8ToString, as the String[] arrays are raw and allow no wrapping. If FieldCache would use accessor methods and not raw arrays, we would not have that problem...

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development