Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Solr's UnInvertedField lets you quickly lookup all terms ords for a
      given doc/field.

      Like, FieldCache, it inverts the index to produce this, and creates a
      RAM-resident data structure holding the bits; but, unlike FieldCache,
      it can handle multiple values per doc, and, it does not hold the term
      bytes in RAM. Rather, it holds only term ords, and then uses
      TermsEnum to resolve ord -> term.

      This is great eg for faceting, where you want to use int ords for all
      of your counting, and then only at the end you need to resolve the
      "top N" ords to their text.

      I think this is a useful core functionality, and we should move most
      of it into Lucene's core. It's a good complement to FieldCache. For
      this first baby step, I just move it into core and refactor Solr's
      usage of it.

      After this, as separate issues, I think there are some things we could
      explore/improve:

      • The first-pass that allocates lots of tiny byte[] looks like it
        could be inefficient. Maybe we could use the byte slices from the
        indexer for this...
      • We can improve the RAM efficiency of the TermIndex: if the codec
        supports ords, and we are operating on one segment, we should just
        use it. If not, we can use a more RAM-efficient data structure,
        eg an FST mapping to the ord.
      • We may be able to improve on the main byte[] representation by
        using packed ints instead of delta-vInt?
      • Eventually we should fold this ability into docvalues, ie we'd
        write the byte[] image at indexing time, and then loading would be
        fast, instead of uninverting
      1. byte_size_32-bit-openjdk6.txt
        3 kB
        Mark Miller
      2. LUCENE-3003.patch
        88 kB
        Michael McCandless
      3. LUCENE-3003.patch
        77 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Attached patch. I moved most of UnInvertedField into Lucene, as
          oal.index.DocTermsOrds, but left the two faceting methods (getCounts,
          getStats) in UnInvertedField. UnInvertedField subclasses
          DocTermOrds.

          I added a simple "OrdIterator" API, for stepping through the ords for
          a doc + field (but, Solr's UnInvertedField still just directly uses
          the packed structures), and a Lucene test case that verifies this is
          working right (though I still need a few more test cases).

          All tests pass, but I have various small nocommits to work through.

          Show
          Michael McCandless added a comment - Attached patch. I moved most of UnInvertedField into Lucene, as oal.index.DocTermsOrds, but left the two faceting methods (getCounts, getStats) in UnInvertedField. UnInvertedField subclasses DocTermOrds. I added a simple "OrdIterator" API, for stepping through the ords for a doc + field (but, Solr's UnInvertedField still just directly uses the packed structures), and a Lucene test case that verifies this is working right (though I still need a few more test cases). All tests pass, but I have various small nocommits to work through.
          Hide
          Jason Rutherglen added a comment -

          Eventually we should fold this ability into docvalues, ie we'd
          write the byte[] image at indexing time, and then loading would be
          fast, instead of uninverting

          I'd guess that pulsing should be 'good enough' most of the time? It seems like there'll be some overlap in terms of the gains from pulsing vis-à-vis DocValues?

          Show
          Jason Rutherglen added a comment - Eventually we should fold this ability into docvalues, ie we'd write the byte[] image at indexing time, and then loading would be fast, instead of uninverting I'd guess that pulsing should be 'good enough' most of the time? It seems like there'll be some overlap in terms of the gains from pulsing vis-à-vis DocValues?
          Hide
          Michael McCandless added a comment -

          I'd guess that pulsing should be 'good enough' most of the time? It seems like there'll be some overlap in terms of the gains from pulsing vis-à-vis DocValues?

          I think Pulsing codec probably doesn't help much here?

          Ie Pulsing is good for terms that have only 1 or 2 docs.

          But for this case (faceting), usually, you have relatively few terms
          and many docs per term?

          Show
          Michael McCandless added a comment - I'd guess that pulsing should be 'good enough' most of the time? It seems like there'll be some overlap in terms of the gains from pulsing vis-à-vis DocValues? I think Pulsing codec probably doesn't help much here? Ie Pulsing is good for terms that have only 1 or 2 docs. But for this case (faceting), usually, you have relatively few terms and many docs per term?
          Hide
          Jason Rutherglen added a comment -

          Ie Pulsing is good for terms that have only 1 or 2 docs

          I thought the default is 16 docs? If there are more then seek'ing to the postings should be negligible (in comparison to a larger aggregate index size when using CSF/DocValues, which'll consume more of the system IO cache)?

          Show
          Jason Rutherglen added a comment - Ie Pulsing is good for terms that have only 1 or 2 docs I thought the default is 16 docs? If there are more then seek'ing to the postings should be negligible (in comparison to a larger aggregate index size when using CSF/DocValues, which'll consume more of the system IO cache)?
          Hide
          Chris Male added a comment -

          +1 to committing this change then tackling the improvements separately.

          Show
          Chris Male added a comment - +1 to committing this change then tackling the improvements separately.
          Hide
          Michael McCandless added a comment -

          New patch, fixing all the nocommits, adding test case for non-null prefix passed to DTO. I think it's ready to commit.

          Show
          Michael McCandless added a comment - New patch, fixing all the nocommits, adding test case for non-null prefix passed to DTO. I think it's ready to commit.
          Hide
          Yonik Seeley added a comment - - edited

          I'd guess that pulsing should be 'good enough' most of the time?

          This already really pulses I think? If the bytes can fit in an int, they are "inlined" right in the pointer that would normally point out to the byte array.

          But for this case (faceting), usually, you have relatively few terms and many docs per term?

          We see everything. But this structure was more optimized for a high number of unique terms, but relatively few per document. This will perform well on a multi-valued author field, but relatively poorly on a large full-text field.

          Show
          Yonik Seeley added a comment - - edited I'd guess that pulsing should be 'good enough' most of the time? This already really pulses I think? If the bytes can fit in an int, they are "inlined" right in the pointer that would normally point out to the byte array. But for this case (faceting), usually, you have relatively few terms and many docs per term? We see everything. But this structure was more optimized for a high number of unique terms, but relatively few per document. This will perform well on a multi-valued author field, but relatively poorly on a large full-text field.
          Hide
          Yonik Seeley added a comment -

          The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this...

          It is inefficient - but I never saw a way around it since the lists are all being built in parallel (due to the fact that we are uninverting).

          Another small & easy optimization I hadn't gotten around to yet was to lower the indexIntervalBits and make it configurable. Another small optimization would be to store an array of offsets to length-prefixed byte arrays, rather than a BytesRef[]. At least the values are already in packed byte arrays via PagedBytes.

          I'd also love to hear others thoughts on this memory optimization for many small byte arrays:

                        // We avoid a doubling strategy to lower memory usage.
                        // this faceting method isn't for docs with many terms.
                        // In hotspot, objects have 2 words of overhead, then fields, rounded up to a 64-bit boundary.
                        // TODO: figure out what array lengths we can round up to w/o actually using more memory
                        // (how much space does a byte[] take up?  Is data preceded by a 32 bit length only?
                        // It should be safe to round up to the nearest 32 bits in any case.
                        int newLen = (newend + 3) & 0xfffffffc;  // 4 byte alignment
          
          Show
          Yonik Seeley added a comment - The first-pass that allocates lots of tiny byte[] looks like it could be inefficient. Maybe we could use the byte slices from the indexer for this... It is inefficient - but I never saw a way around it since the lists are all being built in parallel (due to the fact that we are uninverting). Another small & easy optimization I hadn't gotten around to yet was to lower the indexIntervalBits and make it configurable. Another small optimization would be to store an array of offsets to length-prefixed byte arrays, rather than a BytesRef[]. At least the values are already in packed byte arrays via PagedBytes. I'd also love to hear others thoughts on this memory optimization for many small byte arrays: // We avoid a doubling strategy to lower memory usage. // this faceting method isn't for docs with many terms. // In hotspot, objects have 2 words of overhead, then fields, rounded up to a 64-bit boundary. // TODO: figure out what array lengths we can round up to w/o actually using more memory // (how much space does a byte [] take up? Is data preceded by a 32 bit length only? // It should be safe to round up to the nearest 32 bits in any case . int newLen = (newend + 3) & 0xfffffffc; // 4 byte alignment
          Hide
          Dawid Weiss added a comment -

          For what it's worth, the instrumentation interface allows one to get exact allocation sizes of objects. I put together a small spike at https://github.com/dweiss/poligon/tree/master/instrumenter that measures the actual allocation size of byte[]. On my hotspot, 64-bit, this yields:

          byte[0] takes 24 bytes.
          byte[1] takes 32 bytes.
          byte[2] takes 32 bytes.
          byte[3] takes 32 bytes.
          byte[4] takes 32 bytes.
          byte[5] takes 32 bytes.
          byte[6] takes 32 bytes.
          byte[7] takes 32 bytes.
          byte[8] takes 32 bytes.
          byte[9] takes 40 bytes.
          byte[10] takes 40 bytes.
          byte[11] takes 40 bytes.
          ...
          

          IBM's VM yields the same (64-bit), but the version of jrockit that I have (which may be an old one, but is 64-bit!) yields:

          byte[0] takes 16 bytes.
          byte[1] takes 24 bytes.
          byte[2] takes 24 bytes.
          byte[3] takes 24 bytes.
          byte[4] takes 24 bytes.
          byte[5] takes 24 bytes.
          byte[6] takes 24 bytes.
          byte[7] takes 24 bytes.
          byte[8] takes 24 bytes.
          byte[9] takes 32 bytes.
          byte[10] takes 32 bytes.
          byte[11] takes 32 bytes.
          byte[12] takes 32 bytes.
          byte[13] takes 32 bytes.
          byte[14] takes 32 bytes.
          byte[15] takes 32 bytes.
          byte[16] takes 32 bytes.
          byte[17] takes 40 bytes.
          

          Don't have access to a 32-bit system right now, but if you're keen on checking, checkout that github repo and run:

          cd instrumenter
          mvn package
          java -javaagent:target/instrumenter-0.1.0-SNAPSHOT.jar -version
          
          Show
          Dawid Weiss added a comment - For what it's worth, the instrumentation interface allows one to get exact allocation sizes of objects. I put together a small spike at https://github.com/dweiss/poligon/tree/master/instrumenter that measures the actual allocation size of byte[]. On my hotspot, 64-bit, this yields: byte[0] takes 24 bytes. byte[1] takes 32 bytes. byte[2] takes 32 bytes. byte[3] takes 32 bytes. byte[4] takes 32 bytes. byte[5] takes 32 bytes. byte[6] takes 32 bytes. byte[7] takes 32 bytes. byte[8] takes 32 bytes. byte[9] takes 40 bytes. byte[10] takes 40 bytes. byte[11] takes 40 bytes. ... IBM's VM yields the same (64-bit), but the version of jrockit that I have (which may be an old one, but is 64-bit!) yields: byte[0] takes 16 bytes. byte[1] takes 24 bytes. byte[2] takes 24 bytes. byte[3] takes 24 bytes. byte[4] takes 24 bytes. byte[5] takes 24 bytes. byte[6] takes 24 bytes. byte[7] takes 24 bytes. byte[8] takes 24 bytes. byte[9] takes 32 bytes. byte[10] takes 32 bytes. byte[11] takes 32 bytes. byte[12] takes 32 bytes. byte[13] takes 32 bytes. byte[14] takes 32 bytes. byte[15] takes 32 bytes. byte[16] takes 32 bytes. byte[17] takes 40 bytes. Don't have access to a 32-bit system right now, but if you're keen on checking, checkout that github repo and run: cd instrumenter mvn package java -javaagent:target/instrumenter-0.1.0-SNAPSHOT.jar -version
          Hide
          Yonik Seeley added a comment -

          Thanks Dawid, this suggests that we could round up to the 8 byte boundary for free.

          Show
          Yonik Seeley added a comment - Thanks Dawid, this suggests that we could round up to the 8 byte boundary for free.
          Hide
          Mark Miller added a comment -

          Attached: 32-bit results

          Show
          Mark Miller added a comment - Attached: 32-bit results
          Hide
          Yonik Seeley added a comment -

          Attached: 32-bit results

          Ah, bummer. It's every 8 bytes, but with a 4 byte offset!
          I guess we could make it based on if we detect 32 vs 64 bit jvm... but maybe first see if anyone has any ideas about how to use something like pagedbytes instead.

          Show
          Yonik Seeley added a comment - Attached: 32-bit results Ah, bummer. It's every 8 bytes, but with a 4 byte offset! I guess we could make it based on if we detect 32 vs 64 bit jvm... but maybe first see if anyone has any ideas about how to use something like pagedbytes instead.
          Hide
          Michael McCandless added a comment -

          It is inefficient - but I never saw a way around it since the lists are all being built in parallel (due to the fact that we are uninverting).

          Lucene's indexer (TermsHashPerField) has precisely this same problem
          – every unique term must point to two (well, one if omitTFAP)
          growable byte arrays. We use "slices" into a single big (paged)
          byte[], where first slice is tiny and can only hold like 5 bytes, but
          then points to the next slice which is a bit bigger, etc.

          We could look @ refactoring that for this use too...

          Though this is "just" the one-time startup cost.

          Another small & easy optimization I hadn't gotten around to yet was to lower the indexIntervalBits and make it configurable.

          I did make it configurable to the Lucene class (you can pass it in to
          ctor), but for Solr I left it using every 128th term.

          Another small optimization would be to store an array of offsets to length-prefixed byte arrays, rather than a BytesRef[]. At least the values are already in packed byte arrays via PagedBytes.

          Both FieldCache and docvalues (branch) store an array-of-terms like
          this (the array of offsets is packed ints).

          We should also look at using an FST, which'd be the most compact but
          the ord -> term lookup cost goes up.

          Anyway I think we can pursue these cool ideas on new [future]
          issues...

          Show
          Michael McCandless added a comment - It is inefficient - but I never saw a way around it since the lists are all being built in parallel (due to the fact that we are uninverting). Lucene's indexer (TermsHashPerField) has precisely this same problem – every unique term must point to two (well, one if omitTFAP) growable byte arrays. We use "slices" into a single big (paged) byte[], where first slice is tiny and can only hold like 5 bytes, but then points to the next slice which is a bit bigger, etc. We could look @ refactoring that for this use too... Though this is "just" the one-time startup cost. Another small & easy optimization I hadn't gotten around to yet was to lower the indexIntervalBits and make it configurable. I did make it configurable to the Lucene class (you can pass it in to ctor), but for Solr I left it using every 128th term. Another small optimization would be to store an array of offsets to length-prefixed byte arrays, rather than a BytesRef[]. At least the values are already in packed byte arrays via PagedBytes. Both FieldCache and docvalues (branch) store an array-of-terms like this (the array of offsets is packed ints). We should also look at using an FST, which'd be the most compact but the ord -> term lookup cost goes up. Anyway I think we can pursue these cool ideas on new [future] issues...
          Hide
          Robert Muir added a comment -

          bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.6 pruning: can we push this out to 4.0 (mark resolved?)

          Show
          Robert Muir added a comment - 3.6 pruning: can we push this out to 4.0 (mark resolved?)

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development