Lucene - Core
  1. Lucene - Core
  2. LUCENE-2329

Use parallel arrays instead of PostingList objects

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.

      In order to avoid having very many long-living PostingList objects in TermsHashPerField we want to switch to parallel arrays. The termsHash will simply be a int[] which maps each term to dense termIDs.

      All data that the PostingList classes currently hold will then we placed in parallel arrays, where the termID is the index into the arrays. This will avoid the need for object pooling, will remove the overhead of object initialization and garbage collection. Especially garbage collection should benefit significantly when the JVM runs out of memory, because in such a situation the gc mark times can get very long if there is a big number of long-living objects in memory.

      Another benefit could be to build more efficient TermVectors. We could avoid the need of having to store the term string per document in the TermVector. Instead we could just store the segment-wide termIDs. This would reduce the size and also make it easier to implement efficient algorithms that use TermVectors, because no term mapping across documents in a segment would be necessary. Though this improvement we can make with a separate jira issue.

      1. lucene-2329-2.patch
        11 kB
        Michael Busch
      2. LUCENE-2329.patch
        11 kB
        Michael McCandless
      3. LUCENE-2329.patch
        21 kB
        Michael McCandless
      4. LUCENE-2329.patch
        9 kB
        Michael McCandless
      5. LUCENE-2329.patch
        9 kB
        Michael McCandless
      6. lucene-2329.patch
        48 kB
        Michael Busch
      7. lucene-2329.patch
        49 kB
        Michael Busch
      8. lucene-2329.patch
        48 kB
        Michael Busch

        Activity

        Hide
        Michael McCandless added a comment -

        This would be great!

        But, note that term vectors today do not store the term char[] again – they piggyback on the term char[] already stored for the postings. Though, I believe they store "int textStart" (increments by term length per unique term), which is less compact than the termID would be (increments +1 per unique term), so if eg we someday use packed ints we'd be more RAM efficient by storing termIDs...

        Show
        Michael McCandless added a comment - This would be great! But, note that term vectors today do not store the term char[] again – they piggyback on the term char[] already stored for the postings. Though, I believe they store "int textStart" (increments by term length per unique term), which is less compact than the termID would be (increments +1 per unique term), so if eg we someday use packed ints we'd be more RAM efficient by storing termIDs...
        Hide
        Andrzej Bialecki added a comment -

        Slightly off-topic ... Having a facility to obtain termID-s per segment (or better yet per index) would greatly benefit Solr's UnInverted field creation, which currently needs to assign term ids by linear scanning.

        Show
        Andrzej Bialecki added a comment - Slightly off-topic ... Having a facility to obtain termID-s per segment (or better yet per index) would greatly benefit Solr's UnInverted field creation, which currently needs to assign term ids by linear scanning.
        Hide
        Michael McCandless added a comment -

        This issue is just about how IndexWriter's RAM buffer stores its terms...

        But, the flex API adds long ord() and seek(long ord) to the TermsEnum API.

        Show
        Michael McCandless added a comment - This issue is just about how IndexWriter's RAM buffer stores its terms... But, the flex API adds long ord() and seek(long ord) to the TermsEnum API.
        Hide
        Michael Busch added a comment -

        This issue is just about how IndexWriter's RAM buffer stores its terms...

        Actually, when I talked about the TermVectors I meant we should explore to store the termIDs on disk, rather than the strings. It would help things like similarity search and facet counting.

        But, note that term vectors today do not store the term char[] again - they piggyback on the term char[] already stored for the postings.

        Yeah I think I'm familiar with that part (secondary entry point in TermsHashPerField, hashes based on termStart). Haven't looked much into how the "rest" of the TermVector in-memory data structures are working.

        Though, I believe they store "int textStart" (increments by term length per unique term), which is less compact than the termID would be (increments +1 per unique term)

        Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right? It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist class currently contains (freq, lastOffset, lastPosition)? And the index into that array would be the termID of course.

        This would be a nice simplification, because no hash collisions, no hash table resizing based on load factor, etc. would be necessary for non-primary TermsHashes?

        so if eg we someday use packed ints we'd be more RAM efficient by storing termIDs...

        How does the read performance of packed ints compare to "normal" int[] arrays? I think nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice a bit more RAM for higher search performance? Oh man, will we need flexible indexing for the in-memory index too?

        Show
        Michael Busch added a comment - This issue is just about how IndexWriter's RAM buffer stores its terms... Actually, when I talked about the TermVectors I meant we should explore to store the termIDs on disk , rather than the strings. It would help things like similarity search and facet counting. But, note that term vectors today do not store the term char[] again - they piggyback on the term char[] already stored for the postings. Yeah I think I'm familiar with that part (secondary entry point in TermsHashPerField, hashes based on termStart). Haven't looked much into how the "rest" of the TermVector in-memory data structures are working. Though, I believe they store "int textStart" (increments by term length per unique term), which is less compact than the termID would be (increments +1 per unique term) Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right? It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist class currently contains (freq, lastOffset, lastPosition)? And the index into that array would be the termID of course. This would be a nice simplification, because no hash collisions, no hash table resizing based on load factor, etc. would be necessary for non-primary TermsHashes? so if eg we someday use packed ints we'd be more RAM efficient by storing termIDs... How does the read performance of packed ints compare to "normal" int[] arrays? I think nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice a bit more RAM for higher search performance? Oh man, will we need flexible indexing for the in-memory index too?
        Hide
        Michael McCandless added a comment -

        Actually, when I talked about the TermVectors I meant we should explore to store the termIDs on disk, rather than the strings. It would help things like similarity search and facet counting.

        Ahhhh that would be great!

        Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right? It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist class currently contains (freq, lastOffset, lastPosition)? And the index into that array would be the termID of course.

        Hmm the challenge is that the tracking done for term vectors is just within a single doc. Ie the hash used for term vectors only holds the terms for that one doc (so it's much smaller), vs the primary hash that holds terms for all docs in the current RAM buffer. So we'd be burning up much more RAM if we also key into the term vector's parallel arrays using the primary term id?

        But I do think we should cutover to parallel arrays for TVTW, too.

        How does the read performance of packed ints compare to "normal" int[] arrays? I think nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice a bit more RAM for higher search performance?

        It's definitely slower to read/write to/from packed ints, and I agree, indexing and searching speed trumps RAM efficiency.

        Oh man, will we need flexible indexing for the in-memory index too?

        EG custom attrs appearing in the TokenStream? Yes we will need to... but hopefully once we get serialization working cleanly for the attrs this'll be easy? With ByteSliceWriter/Reader you just .writeBytes and .readBytes...

        I don't think we should allow Codecs to be used in the RAM buffer anytime soon though...

        Show
        Michael McCandless added a comment - Actually, when I talked about the TermVectors I meant we should explore to store the termIDs on disk, rather than the strings. It would help things like similarity search and facet counting. Ahhhh that would be great! Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right? It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist class currently contains (freq, lastOffset, lastPosition)? And the index into that array would be the termID of course. Hmm the challenge is that the tracking done for term vectors is just within a single doc. Ie the hash used for term vectors only holds the terms for that one doc (so it's much smaller), vs the primary hash that holds terms for all docs in the current RAM buffer. So we'd be burning up much more RAM if we also key into the term vector's parallel arrays using the primary term id? But I do think we should cutover to parallel arrays for TVTW, too. How does the read performance of packed ints compare to "normal" int[] arrays? I think nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice a bit more RAM for higher search performance? It's definitely slower to read/write to/from packed ints, and I agree, indexing and searching speed trumps RAM efficiency. Oh man, will we need flexible indexing for the in-memory index too? EG custom attrs appearing in the TokenStream? Yes we will need to... but hopefully once we get serialization working cleanly for the attrs this'll be easy? With ByteSliceWriter/Reader you just .writeBytes and .readBytes... I don't think we should allow Codecs to be used in the RAM buffer anytime soon though...
        Hide
        Michael Busch added a comment -

        Hmm the challenge is that the tracking done for term vectors is just within a single doc.

        Duh! Of course you're right.

        Show
        Michael Busch added a comment - Hmm the challenge is that the tracking done for term vectors is just within a single doc. Duh! Of course you're right.
        Hide
        Michael Busch added a comment -

        Changes the indexer to use parallel arrays instead of PostingList objects (for both FreqProx* and TermVectors*).

        All core & contrib & bw tests pass. I haven't done performance tests yet.

        I'm wondering how to manage the size of the parallel array? I started with an initial size for the parallel array equal to the size of the postingsHash array. When it's full then I allocate a new one with 1.5x size. When shrinkHash() is called I also shrink the parallel array to the same size as postingsHash. How does that sound?

        Show
        Michael Busch added a comment - Changes the indexer to use parallel arrays instead of PostingList objects (for both FreqProx* and TermVectors*). All core & contrib & bw tests pass. I haven't done performance tests yet. I'm wondering how to manage the size of the parallel array? I started with an initial size for the parallel array equal to the size of the postingsHash array. When it's full then I allocate a new one with 1.5x size. When shrinkHash() is called I also shrink the parallel array to the same size as postingsHash. How does that sound?
        Hide
        Michael Busch added a comment -

        One change I should make to the patch is how to track the memory consumption. When the parallel array is allocated or grown then bytesAllocated() should be called? And when a new termID is added, should only then bytesUsed() be called?

        Show
        Michael Busch added a comment - One change I should make to the patch is how to track the memory consumption. When the parallel array is allocated or grown then bytesAllocated() should be called? And when a new termID is added, should only then bytesUsed() be called?
        Hide
        Michael Busch added a comment -

        Made the memory tracking changes as described in my previous comment.

        All tests still pass.

        Show
        Michael Busch added a comment - Made the memory tracking changes as described in my previous comment. All tests still pass.
        Hide
        Michael McCandless added a comment -

        Looks great Michael!

        I think *ParallelPostingsArray.reset do not need to zero-fill the arrays – these are overwritten when that termID is first used, right?

        Show
        Michael McCandless added a comment - Looks great Michael! I think *ParallelPostingsArray.reset do not need to zero-fill the arrays – these are overwritten when that termID is first used, right?
        Hide
        Michael Busch added a comment -

        I think *ParallelPostingsArray.reset do not need to zero-fill the arrays - these are overwritten when that termID is first used, right?

        Good point! I'll remove the reset() methods.

        Show
        Michael Busch added a comment - I think *ParallelPostingsArray.reset do not need to zero-fill the arrays - these are overwritten when that termID is first used, right? Good point! I'll remove the reset() methods.
        Hide
        Michael Busch added a comment -

        Removed reset(). All tests still pass.

        Show
        Michael Busch added a comment - Removed reset(). All tests still pass.
        Hide
        Michael Busch added a comment - - edited

        I did some performance experiments:

        I indexed 1M wikipedia documents using the cheap WhiteSpaceAnalyzer, no cfs files, disabled any merging, RAM buffer size = 200MB, single writer thread, TermVectors enabled.

        Test machine: MacBook Pro, 2.53 GHz Intel Core 2 Duo, 4 GB 1067 MHz DDR3, MacOS X 10.5.8.

        Results with -Xmx2000m:

          Write performance Gain Number of segments
        trunk 833 docs/sec   41
        trunk + parallel arrays 869 docs/sec + 4.3% 32

        Results with -Xmx256m:

          Write performance Gain Number of segments
        trunk 467 docs/sec   41
        trunk + parallel arrays 871 docs/sec +86.5% 32

        So I think these results are interesting and roughly as expected. 4.3% is a nice small performance gain.
        But running the tests with a low heap shows how much cheaper the garbage collection becomes. Setting IW's RAM buffer to 200MB and the overall heap to 256MB forces the gc to run frequently. The mark times are much more costly if we have all long-living PostingList objects in memory compared to parallel arrays.

        So this is probably not a huge deal for "normal" indexing. But once we can search on the RAM buffer it becomes much more attractive to fill up the RAM as much as you can. And exactly in that case we safe a lot with this improvement.

        Also note that the number of segments decreased by 22% (from 41 to 32). This shows that the parallel-array approach needs less RAM, thus flushes less often and will cause less segment merges in the long run. So a longer test with actual segment merges would show even bigger gains (with both big and small heaps).

        So overall, I'm very happy with these results!

        Show
        Michael Busch added a comment - - edited I did some performance experiments: I indexed 1M wikipedia documents using the cheap WhiteSpaceAnalyzer, no cfs files, disabled any merging, RAM buffer size = 200MB, single writer thread, TermVectors enabled. Test machine: MacBook Pro, 2.53 GHz Intel Core 2 Duo, 4 GB 1067 MHz DDR3, MacOS X 10.5.8. Results with -Xmx2000m:   Write performance Gain Number of segments trunk 833 docs/sec   41 trunk + parallel arrays 869 docs/sec + 4.3% 32 Results with -Xmx256m:   Write performance Gain Number of segments trunk 467 docs/sec   41 trunk + parallel arrays 871 docs/sec +86.5% 32 So I think these results are interesting and roughly as expected. 4.3% is a nice small performance gain. But running the tests with a low heap shows how much cheaper the garbage collection becomes. Setting IW's RAM buffer to 200MB and the overall heap to 256MB forces the gc to run frequently. The mark times are much more costly if we have all long-living PostingList objects in memory compared to parallel arrays. So this is probably not a huge deal for "normal" indexing. But once we can search on the RAM buffer it becomes much more attractive to fill up the RAM as much as you can. And exactly in that case we safe a lot with this improvement. Also note that the number of segments decreased by 22% (from 41 to 32). This shows that the parallel-array approach needs less RAM, thus flushes less often and will cause less segment merges in the long run. So a longer test with actual segment merges would show even bigger gains (with both big and small heaps). So overall, I'm very happy with these results!
        Hide
        Michael McCandless added a comment -

        Sweet, this looks great Michael! Less RAM used and faster indexing (much less GC load) – win/win.

        It's a little surprising that the segment count dropped from 41 -> 32? Ie, how much less RAM do the parallel arrays take? They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer? But, other things use RAM (the docIDs in the postings themselves, norms, etc.) so it's surprising the savings was so much that you get 22% fewer segments... are you sure there isn't a bug in the RAM usage accounting?

        Show
        Michael McCandless added a comment - Sweet, this looks great Michael! Less RAM used and faster indexing (much less GC load) – win/win. It's a little surprising that the segment count dropped from 41 -> 32? Ie, how much less RAM do the parallel arrays take? They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer? But, other things use RAM (the docIDs in the postings themselves, norms, etc.) so it's surprising the savings was so much that you get 22% fewer segments... are you sure there isn't a bug in the RAM usage accounting?
        Hide
        Michael Busch added a comment -

        so it's surprising the savings was so much that you get 22% fewer segments... are you sure there isn't a bug in the RAM usage accounting?

        Yeah it seems a bit suspicious. I'll investigate. But, keep in mind that TermVectors were enabled too. And the number of "unique terms" in the 2nd TermsHash is higher, i.e. if you summed up numPostings from the 2nd TermsHash in each round that sum should be higher than numPostings from the first TermsHash.

        Show
        Michael Busch added a comment - so it's surprising the savings was so much that you get 22% fewer segments... are you sure there isn't a bug in the RAM usage accounting? Yeah it seems a bit suspicious. I'll investigate. But, keep in mind that TermVectors were enabled too. And the number of "unique terms" in the 2nd TermsHash is higher, i.e. if you summed up numPostings from the 2nd TermsHash in each round that sum should be higher than numPostings from the first TermsHash.
        Hide
        Michael McCandless added a comment -

        But, keep in mind that TermVectors were enabled too.

        OK, but, RAM used by TermVectors* shouldn't participate in the accounting... ie it only holds RAM for the one doc, at a time.

        And the number of "unique terms" in the 2nd TermsHash is higher, i.e. if you summed up numPostings from the 2nd TermsHash in each round that sum should be higher than numPostings from the first TermsHash.

        1st TermsHash = current trunk and 2nd TermsHash = this patch? Ie, it has more unique terms at flush time (because it's more RAM efficient)? If so, then yes, I agree But 22% fewer still seems too good to be true...

        Show
        Michael McCandless added a comment - But, keep in mind that TermVectors were enabled too. OK, but, RAM used by TermVectors* shouldn't participate in the accounting... ie it only holds RAM for the one doc, at a time. And the number of "unique terms" in the 2nd TermsHash is higher, i.e. if you summed up numPostings from the 2nd TermsHash in each round that sum should be higher than numPostings from the first TermsHash. 1st TermsHash = current trunk and 2nd TermsHash = this patch? Ie, it has more unique terms at flush time (because it's more RAM efficient)? If so, then yes, I agree But 22% fewer still seems too good to be true...
        Hide
        Michael Busch added a comment -

        OK, but, RAM used by TermVectors* shouldn't participate in the accounting... ie it only holds RAM for the one doc, at a time.

        Man, my brain is lacking the TermVector synapses...

        Show
        Michael Busch added a comment - OK, but, RAM used by TermVectors* shouldn't participate in the accounting... ie it only holds RAM for the one doc, at a time. Man, my brain is lacking the TermVector synapses...
        Hide
        Michael Busch added a comment - - edited

        They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer?

        We actually save on 64bit JVMs (which I used for my tests) 28 bytes per unique-term:

        Trunk:

            // Why + 4*POINTER_NUM_BYTE below?
            //   +1: Posting is referenced by postingsFreeList array
            //   +3: Posting is referenced by hash, which
            //       targets 25-50% fill factor; approximate this
            //       as 3X # pointers
            bytesPerPosting = consumer.bytesPerPosting() + 4*DocumentsWriter.POINTER_NUM_BYTE;
        
        ...
        
          @Override
          int bytesPerPosting() {
            return RawPostingList.BYTES_SIZE + 4 * DocumentsWriter.INT_NUM_BYTE;
          }
        
        ...
        abstract class RawPostingList {
          final static int BYTES_SIZE = DocumentsWriter.OBJECT_HEADER_BYTES + 3*DocumentsWriter.INT_NUM_BYTE;
        
        ...
        
          // Coarse estimates used to measure RAM usage of buffered deletes
          final static int OBJECT_HEADER_BYTES = 8;
          final static int POINTER_NUM_BYTE = Constants.JRE_IS_64BIT ? 8 : 4;
        

        This needs 8 bytes + 3 * 4 bytes + 4 * 4 bytes + 4 * 8 bytes = 68 bytes.

        2329:

            //   +3: Posting is referenced by hash, which
            //       targets 25-50% fill factor; approximate this
            //       as 3X # pointers
            bytesPerPosting = consumer.bytesPerPosting() + 3*DocumentsWriter.INT_NUM_BYTE;
        
        ...
        
          @Override
          int bytesPerPosting() {
            return ParallelPostingsArray.BYTES_PER_POSTING + 4 * DocumentsWriter.INT_NUM_BYTE;
          }
        
        ...
        
        final static int BYTES_PER_POSTING = 3 * DocumentsWriter.INT_NUM_BYTE;
        

        This needs 3 * 4 bytes + 4 * 4 bytes + 3 * 4 bytes = 40 bytes.

        I checked how many bytes were allocated for postings when the first segment was flushed. Trunk flushed after 6400 docs and had 103MB allocated for PostingList objects. 2329 flushed after 8279 docs and had 94MB allocated for the parallel arrays, and 74MB out of the 94MB were actually used.

        The first docs in the wikipedia dataset seem pretty large with many unique terms.

        I think this sounds reasonable?

        Show
        Michael Busch added a comment - - edited They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer? We actually save on 64bit JVMs (which I used for my tests) 28 bytes per unique-term: Trunk: // Why + 4*POINTER_NUM_BYTE below? // +1: Posting is referenced by postingsFreeList array // +3: Posting is referenced by hash, which // targets 25-50% fill factor; approximate this // as 3X # pointers bytesPerPosting = consumer.bytesPerPosting() + 4*DocumentsWriter.POINTER_NUM_BYTE; ... @Override int bytesPerPosting() { return RawPostingList.BYTES_SIZE + 4 * DocumentsWriter.INT_NUM_BYTE; } ... abstract class RawPostingList { final static int BYTES_SIZE = DocumentsWriter.OBJECT_HEADER_BYTES + 3*DocumentsWriter.INT_NUM_BYTE; ... // Coarse estimates used to measure RAM usage of buffered deletes final static int OBJECT_HEADER_BYTES = 8; final static int POINTER_NUM_BYTE = Constants.JRE_IS_64BIT ? 8 : 4; This needs 8 bytes + 3 * 4 bytes + 4 * 4 bytes + 4 * 8 bytes = 68 bytes. 2329: // +3: Posting is referenced by hash, which // targets 25-50% fill factor; approximate this // as 3X # pointers bytesPerPosting = consumer.bytesPerPosting() + 3*DocumentsWriter.INT_NUM_BYTE; ... @Override int bytesPerPosting() { return ParallelPostingsArray.BYTES_PER_POSTING + 4 * DocumentsWriter.INT_NUM_BYTE; } ... final static int BYTES_PER_POSTING = 3 * DocumentsWriter.INT_NUM_BYTE; This needs 3 * 4 bytes + 4 * 4 bytes + 3 * 4 bytes = 40 bytes. I checked how many bytes were allocated for postings when the first segment was flushed. Trunk flushed after 6400 docs and had 103MB allocated for PostingList objects. 2329 flushed after 8279 docs and had 94MB allocated for the parallel arrays, and 74MB out of the 94MB were actually used. The first docs in the wikipedia dataset seem pretty large with many unique terms. I think this sounds reasonable?
        Hide
        Michael McCandless added a comment -

        OK indeed it does sounds reasonable! Sweet I think you should commit it! Make sure you "svn switch" your checkout first And pass Solr's tests!

        Show
        Michael McCandless added a comment - OK indeed it does sounds reasonable! Sweet I think you should commit it! Make sure you "svn switch" your checkout first And pass Solr's tests!
        Hide
        Michael Busch added a comment -

        Cool, will do! Thanks for the review and good questions... and the whole idea!

        Show
        Michael Busch added a comment - Cool, will do! Thanks for the review and good questions... and the whole idea!
        Hide
        Michael Busch added a comment -

        Committed revision 926791.

        Show
        Michael Busch added a comment - Committed revision 926791.
        Hide
        Michael McCandless added a comment -

        I think we need to fix how RAM is managed for this... right now, if
        you turn on IW's infoStream you'll see a zillion prints where IW tries
        to balance RAM (it "runs hot"), but, nothing can be freed. We do this
        per-doc, after the parallel arrays resize themselves to net/net over
        our allowed RAM buffer.

        A few ideas on how we can fix:

        • I think we have to change when we flush. It's now based on RAM
          used (not alloc'd), but I think we should switch it to use RAM
          alloc'd after we've freed all we can. Ie if we free things up and
          we've still alloc'd over the limit, we flush. This'll fix the
          running hot we now see...
        • TermsHash.freeRAM is now a no-op right? We have to fix this to
          actually free something when it can because you can imagine
          indexing docs that are postings heavy but then switching to docs
          that are byte[] block heavy. On that switch you have to balance
          the allocations (ie, shrink your postings). I think we should
          walk the threads/fields and use ArrayUtil.shrink to shrink down,
          but, don't shrink by much at a time (to avoid running hot) – IW
          will invoke this method again if more shrinkage is needed.
        • Also, shouldn't we use ArrayUtil.grow to increase? Instead of
          always a 50% growth? Because with such a large growth you can
          easily have horrible RAM efficiency... ie that 50% growth can
          suddenly put you over the limit and then you flush, having
          effectively used only half of the allowed RAM buffer in the worst
          case.
        Show
        Michael McCandless added a comment - I think we need to fix how RAM is managed for this... right now, if you turn on IW's infoStream you'll see a zillion prints where IW tries to balance RAM (it "runs hot"), but, nothing can be freed. We do this per-doc, after the parallel arrays resize themselves to net/net over our allowed RAM buffer. A few ideas on how we can fix: I think we have to change when we flush. It's now based on RAM used (not alloc'd), but I think we should switch it to use RAM alloc'd after we've freed all we can. Ie if we free things up and we've still alloc'd over the limit, we flush. This'll fix the running hot we now see... TermsHash.freeRAM is now a no-op right? We have to fix this to actually free something when it can because you can imagine indexing docs that are postings heavy but then switching to docs that are byte[] block heavy. On that switch you have to balance the allocations (ie, shrink your postings). I think we should walk the threads/fields and use ArrayUtil.shrink to shrink down, but, don't shrink by much at a time (to avoid running hot) – IW will invoke this method again if more shrinkage is needed. Also, shouldn't we use ArrayUtil.grow to increase? Instead of always a 50% growth? Because with such a large growth you can easily have horrible RAM efficiency... ie that 50% growth can suddenly put you over the limit and then you flush, having effectively used only half of the allowed RAM buffer in the worst case.
        Hide
        Michael McCandless added a comment -

        Reopening to fix the RAM balancing problems...

        Show
        Michael McCandless added a comment - Reopening to fix the RAM balancing problems...
        Hide
        Michael Busch added a comment -

        Good catch!

        Thanks for the thorough explanation and suggestions. I think it all makes sense. Will work on a patch.

        Show
        Michael Busch added a comment - Good catch! Thanks for the thorough explanation and suggestions. I think it all makes sense. Will work on a patch.
        Hide
        Michael Busch added a comment -

        This patch:

        • Changes DocumentsWriter to trigger the flush using bytesAllocated instead of bytesUsed to improve the "running hot" issue Mike's seeing
        • Improves the ParallelPostingsArray to grow using ArrayUtil.oversize()

        In IRC we discussed changing TermsHashPerField to shrink the parallel arrays in freeRAM(), but that involves tricky thread-safety changes, because one thread could call DocumentsWriter.balanceRAM(), which triggers freeRAM() across all thread states, while other threads keep indexing.

        We decided to leave it the way it currently works: we discard the whole parallel array during flush and don't reuse it. This is not as optimal as it could be, but once LUCENE-2324 is done this won't be an issue anymore anyway.

        Note that this new patch is against the flex branch: I thought we'd switch it over soon anyway? I can also create a patch for trunk if that's preferred.

        Show
        Michael Busch added a comment - This patch: Changes DocumentsWriter to trigger the flush using bytesAllocated instead of bytesUsed to improve the "running hot" issue Mike's seeing Improves the ParallelPostingsArray to grow using ArrayUtil.oversize() In IRC we discussed changing TermsHashPerField to shrink the parallel arrays in freeRAM(), but that involves tricky thread-safety changes, because one thread could call DocumentsWriter.balanceRAM(), which triggers freeRAM() across all thread states, while other threads keep indexing. We decided to leave it the way it currently works: we discard the whole parallel array during flush and don't reuse it. This is not as optimal as it could be, but once LUCENE-2324 is done this won't be an issue anymore anyway. Note that this new patch is against the flex branch: I thought we'd switch it over soon anyway? I can also create a patch for trunk if that's preferred.
        Hide
        Michael McCandless added a comment -

        Fixed a couple of issues on the last patch:

        • We weren't notifying DW that we freed up RAM when setting postingsArray to null
        • We can shrink postingsArray separately from the termsHash, and, instead of nulling it, we can simply downsize it

        Tests pass, and indexing perf on first 10M 1 KB sized wikipedia docs is a bit faster though probably in the noise (1296 sec on current flex head vs 1238 sec with this patch).

        Show
        Michael McCandless added a comment - Fixed a couple of issues on the last patch: We weren't notifying DW that we freed up RAM when setting postingsArray to null We can shrink postingsArray separately from the termsHash, and, instead of nulling it, we can simply downsize it Tests pass, and indexing perf on first 10M 1 KB sized wikipedia docs is a bit faster though probably in the noise (1296 sec on current flex head vs 1238 sec with this patch).
        Hide
        Michael McCandless added a comment -

        New patch attached:

        • Does away with separate tracking of used vs alloc, in IndexWriter.
          This distinction added much complexity and only saved a small
          number of free/alloc's per flush cycle, especially now that
          postings realloc only in big chunks (parallel arrays).
        • Fixed some over-counting of bytes used.

        The indexing throughput is basically unchanged after this (on first
        10M 1KB Wikipedia docs), so I think this is a good simplification.

        Show
        Michael McCandless added a comment - New patch attached: Does away with separate tracking of used vs alloc, in IndexWriter. This distinction added much complexity and only saved a small number of free/alloc's per flush cycle, especially now that postings realloc only in big chunks (parallel arrays). Fixed some over-counting of bytes used. The indexing throughput is basically unchanged after this (on first 10M 1KB Wikipedia docs), so I think this is a good simplification.
        Hide
        Michael Busch added a comment -

        Looks great! I like the removal of bytesAlloc - nice simplification.

        Show
        Michael Busch added a comment - Looks great! I like the removal of bytesAlloc - nice simplification.
        Hide
        Michael McCandless added a comment -

        Thanks Michael... I'll commit shortly. It's a good simplification.

        Show
        Michael McCandless added a comment - Thanks Michael... I'll commit shortly. It's a good simplification.
        Hide
        Michael Busch added a comment -

        Thanks! I think we can resolve this now?

        Show
        Michael Busch added a comment - Thanks! I think we can resolve this now?
        Hide
        Michael McCandless added a comment -

        Woops, right!

        Show
        Michael McCandless added a comment - Woops, right!
        Hide
        Michael McCandless added a comment -

        Reopening – this fixed causes an intermittent deadlock in
        TestStressIndexing2.

        It's actually a pre-existing issue, whereby if a flush happens only
        because of deletions (ie no indexed docs), and you're using multiple
        threads, it's possible some idled threads would fail to be notified
        to wake up and continue indexing once the flush completes.

        The fix here increased the chance of hitting that bug because the RAM
        accounting has a bug whereby it overly-aggressively flushes because of
        deletions, ie, rather than free up RAM allocated but not used for
        indexing, it flushes.

        I first fixed the deadlock case (need to clear DW's flushPending when
        we only flush deletes).

        Then I fixed the shared deletes/indexing RAM by:

        • Not reusing the RAM for postings arrays – we now null this out
          for every field after flushing
        • Calling balanceRAM when deletes have filled up RAM before deciding
          to flush, because this can free RAM up, making more space for
          deletes.

        I also further simplified things – no more separate call to
        doBalanceRAM, and added a fun unit test that randomly alternates
        between pure indexing and pure deleting, asserting that the flushing
        doesn't "run hot" on any of those transitions.

        Show
        Michael McCandless added a comment - Reopening – this fixed causes an intermittent deadlock in TestStressIndexing2. It's actually a pre-existing issue, whereby if a flush happens only because of deletions (ie no indexed docs), and you're using multiple threads, it's possible some idled threads would fail to be notified to wake up and continue indexing once the flush completes. The fix here increased the chance of hitting that bug because the RAM accounting has a bug whereby it overly-aggressively flushes because of deletions, ie, rather than free up RAM allocated but not used for indexing, it flushes. I first fixed the deadlock case (need to clear DW's flushPending when we only flush deletes). Then I fixed the shared deletes/indexing RAM by: Not reusing the RAM for postings arrays – we now null this out for every field after flushing Calling balanceRAM when deletes have filled up RAM before deciding to flush, because this can free RAM up, making more space for deletes. I also further simplified things – no more separate call to doBalanceRAM, and added a fun unit test that randomly alternates between pure indexing and pure deleting, asserting that the flushing doesn't "run hot" on any of those transitions.
        Hide
        Michael Busch added a comment -

        We could move the if (postingsArray == null) check to start(), then we don't have to check for every new term?

        Show
        Michael Busch added a comment - We could move the if (postingsArray == null) check to start(), then we don't have to check for every new term?
        Hide
        Michael McCandless added a comment -

        We could move the if (postingsArray == null) check to start(), then we don't have to check for every new term?

        Excellent, I'll do that!

        Show
        Michael McCandless added a comment - We could move the if (postingsArray == null) check to start(), then we don't have to check for every new term? Excellent, I'll do that!
        Hide
        Michael McCandless added a comment -

        New patch, init'ing the postings arrays in THPF.start instead of per term.

        Show
        Michael McCandless added a comment - New patch, init'ing the postings arrays in THPF.start instead of per term.
        Hide
        Michael McCandless added a comment -

        Third time's a charm?

        Show
        Michael McCandless added a comment - Third time's a charm?
        Hide
        Uwe Schindler added a comment -

        This is broken now in stable branch. We should fix it, hudsons clover tests are hung in benchmark.

        Show
        Uwe Schindler added a comment - This is broken now in stable branch. We should fix it, hudsons clover tests are hung in benchmark.
        Hide
        Michael Busch added a comment -

        I probably missed something here. What exactly is broken and how is it related to this patch?

        Show
        Michael Busch added a comment - I probably missed something here. What exactly is broken and how is it related to this patch?
        Hide
        Michael McCandless added a comment -

        Michael I'll take care of it – we just need to merge all commits under this issue, to stable. Only the 1st commit made it... I'm working on it now.

        Show
        Michael McCandless added a comment - Michael I'll take care of it – we just need to merge all commits under this issue, to stable. Only the 1st commit made it... I'm working on it now.
        Hide
        Michael McCandless added a comment -

        OK I merged all commits to 3x.

        Show
        Michael McCandless added a comment - OK I merged all commits to 3x.
        Hide
        Michael Busch added a comment -

        Ah got it. Thanks for taking care of it!

        Show
        Michael Busch added a comment - Ah got it. Thanks for taking care of it!
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Michael Busch
            Reporter:
            Michael Busch
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development