Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This issue will have the BytesHash separated out from LUCENE-2186

      1. LUCENE-2662.patch
        12 kB
        Jason Rutherglen
      2. LUCENE-2662.patch
        28 kB
        Simon Willnauer
      3. LUCENE-2662.patch
        62 kB
        Simon Willnauer
      4. LUCENE-2662.patch
        80 kB
        Simon Willnauer
      5. LUCENE-2662.patch
        93 kB
        Simon Willnauer
      6. LUCENE-2662.patch
        93 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Jason Rutherglen added a comment -

          We need unit tests and a base implementation as BytesHash is abstract...

          Show
          Jason Rutherglen added a comment - We need unit tests and a base implementation as BytesHash is abstract...
          Hide
          Jason Rutherglen added a comment -

          The current hash implementation needs to be separated out of TermsHashPerField.

          Show
          Jason Rutherglen added a comment - The current hash implementation needs to be separated out of TermsHashPerField.
          Hide
          Robert Muir added a comment -

          Jason: I am confused... there is no hash impl in TermsHashPerField.

          the hashing, and term encoding and other things, is the responsibility of the analysis chain (TermToBytesRefAttribute):

              // Get the text & hash of this term.
              int code = termAtt.toBytesRef(utf8);
          

          this way, implementations can 'hash-as-they-go' like we do when encoding unicode char[] -> byte[],
          or they can simply return BytesRef.hashCode() if they don't have an optimized implementation.

          Show
          Robert Muir added a comment - Jason: I am confused... there is no hash impl in TermsHashPerField. the hashing, and term encoding and other things, is the responsibility of the analysis chain (TermToBytesRefAttribute): // Get the text & hash of this term. int code = termAtt.toBytesRef(utf8); this way, implementations can 'hash-as-they-go' like we do when encoding unicode char[] -> byte[], or they can simply return BytesRef.hashCode() if they don't have an optimized implementation.
          Hide
          Jason Rutherglen added a comment -

          The THPF is hashing tokens for use in the indexing RAM buffer and the creation of postings, ie, the lookup of term byte[]s to term ids. The hash component is currently interwoven into THPF.

          Here's some of the variables being used in THPF.

          private int postingsHashSize = 4;
          private int postingsHashHalfSize = postingsHashSize/2;
          private int postingsHashMask = postingsHashSize-1;
          private int[] postingsHash;
          

          Also there's the methods rehashPostings, shrinkHash, postingEquals, and add(int textStart) has the lookup.

          We'll probably also need to separate out the quick sort implementation in THPF, I'll add that to this issue.

          Show
          Jason Rutherglen added a comment - The THPF is hashing tokens for use in the indexing RAM buffer and the creation of postings, ie, the lookup of term byte[]s to term ids. The hash component is currently interwoven into THPF. Here's some of the variables being used in THPF. private int postingsHashSize = 4; private int postingsHashHalfSize = postingsHashSize/2; private int postingsHashMask = postingsHashSize-1; private int [] postingsHash; Also there's the methods rehashPostings, shrinkHash, postingEquals, and add(int textStart) has the lookup. We'll probably also need to separate out the quick sort implementation in THPF, I'll add that to this issue.
          Hide
          Robert Muir added a comment -

          Jason: what I am saying is if i look at the method in your patch:

          public T add(BytesRef bytes)

          the first thing it does is compute the hash, but this is already computed in the analysis chain.

          why not have

          public T add(BytesRef bytes, int hashCode)
          

          and also:

          public T add(BytesRef bytes) {
            return add(bytes, bytes.hashCode());
          }
          

          then we can avoid computing this twice, and keep the optimization in UnicodeUtil

          Show
          Robert Muir added a comment - Jason: what I am saying is if i look at the method in your patch: public T add(BytesRef bytes) the first thing it does is compute the hash, but this is already computed in the analysis chain. why not have public T add(BytesRef bytes, int hashCode) and also: public T add(BytesRef bytes) { return add(bytes, bytes.hashCode()); } then we can avoid computing this twice, and keep the optimization in UnicodeUtil
          Hide
          Jason Rutherglen added a comment -

          Ah, ok, I didn't write this code, I extracted it from LUCENE-2186, and nice, you reviewed it can be improved. I'll make changes to it shortly, hopefully.

          Show
          Jason Rutherglen added a comment - Ah, ok, I didn't write this code, I extracted it from LUCENE-2186 , and nice, you reviewed it can be improved. I'll make changes to it shortly, hopefully.
          Hide
          Simon Willnauer added a comment -

          jason, can you please hold off with this since I have newer / different versions of this class already with tests etc. I understand that you need that class but creating all these issues and rushing ahead is rather counter productive.

          @Robert: this class is standalone in this patch and doesn't know about the analysis chain. But thanks for the comments I will incorporate them.

          simon

          Show
          Simon Willnauer added a comment - jason, can you please hold off with this since I have newer / different versions of this class already with tests etc. I understand that you need that class but creating all these issues and rushing ahead is rather counter productive. @Robert: this class is standalone in this patch and doesn't know about the analysis chain. But thanks for the comments I will incorporate them. simon
          Hide
          Jason Rutherglen added a comment -

          Simon, when do you think you'll be posting?

          Show
          Jason Rutherglen added a comment - Simon, when do you think you'll be posting?
          Hide
          Simon Willnauer added a comment -

          Simon, when do you think you'll be posting?

          maybe within the next week I have a busy schedule but does this patch keep you from doing any work? You shouldn't just pull out stuff from 1 month old patches especially as you don't even give me time to reply on the orig. discussion.

          Any rush on this?

          Show
          Simon Willnauer added a comment - Simon, when do you think you'll be posting? maybe within the next week I have a busy schedule but does this patch keep you from doing any work? You shouldn't just pull out stuff from 1 month old patches especially as you don't even give me time to reply on the orig. discussion. Any rush on this?
          Hide
          Jason Rutherglen added a comment -

          It'd be nice to get deletes working, ie, LUCENE-2655 and move forward in a way that's useful long term. What changes have you made?

          Show
          Jason Rutherglen added a comment - It'd be nice to get deletes working, ie, LUCENE-2655 and move forward in a way that's useful long term. What changes have you made?
          Hide
          Simon Willnauer added a comment -

          This patch contains a slightly different version of BytesHash (renamed it to BytesRefHash but that is to be discussed - while writing this I actually think BytesHash is the better name). BytesRefHash is now final and does not create Entry objects anymore. Internally it maintains two integer arrays one acting as the hash buckets and the other one contain the bytes-start offset in the ByteBlockPool. Each added entry is assigned to an increasing ordinal since this is what Entry is used in almost all use-cases (in CSF though). For TermsHashPerField this is also "native" since is uses the same kind of referencing system.

          These changes keep this class as efficient as possible, keeping GC costs low and allows JIT to do better optimizations. IMO this class is super performance critical and since we recently refactored indexing towards parallel arrays adding another "object" array might not be the way to go anyway.

          I also incorporated robers comments - thanks for the review anyway. I guess that is the first step towards factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first?

          comments / review welcome!!

          One more thing, I did not move ByteBlockPool to o.a.l.utils but I thing it belongs there, thoughts?

          Show
          Simon Willnauer added a comment - This patch contains a slightly different version of BytesHash (renamed it to BytesRefHash but that is to be discussed - while writing this I actually think BytesHash is the better name). BytesRefHash is now final and does not create Entry objects anymore. Internally it maintains two integer arrays one acting as the hash buckets and the other one contain the bytes-start offset in the ByteBlockPool. Each added entry is assigned to an increasing ordinal since this is what Entry is used in almost all use-cases (in CSF though). For TermsHashPerField this is also "native" since is uses the same kind of referencing system. These changes keep this class as efficient as possible, keeping GC costs low and allows JIT to do better optimizations. IMO this class is super performance critical and since we recently refactored indexing towards parallel arrays adding another "object" array might not be the way to go anyway. I also incorporated robers comments - thanks for the review anyway. I guess that is the first step towards factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first? comments / review welcome!! One more thing, I did not move ByteBlockPool to o.a.l.utils but I thing it belongs there, thoughts?
          Hide
          Robert Muir added a comment -

          I guess that is the first step towards factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first?

          I think it would be better if this class were used in the patch... i wouldn't commit it by itself unused. Its difficult for people to review its behavior, since its just a standalone unused thing (for instance, the hashCode thing i brought up)

          Show
          Robert Muir added a comment - I guess that is the first step towards factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first? I think it would be better if this class were used in the patch... i wouldn't commit it by itself unused. Its difficult for people to review its behavior, since its just a standalone unused thing (for instance, the hashCode thing i brought up)
          Hide
          Jason Rutherglen added a comment -

          > BytesRefHash is now final and does not create Entry objects anymore

          That's good.

          > move ByteBlockPool to o.a.l.utils

          Sure why not.

          > factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first?

          We need to factor it out of THPF otherwise this patch isn't really useful for committing. Also, it'll get tested through the entirety of the unit tests, ie, it'll get put through the laundry.

          Show
          Jason Rutherglen added a comment - > BytesRefHash is now final and does not create Entry objects anymore That's good. > move ByteBlockPool to o.a.l.utils Sure why not. > factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first? We need to factor it out of THPF otherwise this patch isn't really useful for committing. Also, it'll get tested through the entirety of the unit tests, ie, it'll get put through the laundry.
          Hide
          Simon Willnauer added a comment -

          We need to factor it out of THPF otherwise this patch isn't really useful for committing. Also, it'll get tested through the entirety of the unit tests, ie, it'll get put through the laundry.

          Yeah, lets see this as the first baby step towards it. I will move ByteBockPool to o.a.l.utils and start cutting THPF over to it. We need to do benchmarking in any case just to make sure JIT doesn't play nasty tricks with us again.

          simon

          Show
          Simon Willnauer added a comment - We need to factor it out of THPF otherwise this patch isn't really useful for committing. Also, it'll get tested through the entirety of the unit tests, ie, it'll get put through the laundry. Yeah, lets see this as the first baby step towards it. I will move ByteBockPool to o.a.l.utils and start cutting THPF over to it. We need to do benchmarking in any case just to make sure JIT doesn't play nasty tricks with us again. simon
          Hide
          Jason Rutherglen added a comment -

          make sure JIT doesn't play nasty tricks with us again.

          What would we do if this happens?

          Show
          Jason Rutherglen added a comment - make sure JIT doesn't play nasty tricks with us again. What would we do if this happens?
          Hide
          Michael McCandless added a comment -

          Patch looks good Simon – some ideas:

          • In the class jdocs, I think state that this is basically a
            Map<BytesRef,int>?
          • Maybe we also move ByteBlockPool --> oal.util?
          • Maybe move out the ByteBlockAllocator to its own class (in util)?
            RecyclingByteBlockAllocator?
          • Can we have DocumentsWriter share the ByteBlockAllocator? (Right
            now it's dup'd code since DW also implements this).
          • Maybe rename ords -> keys? And hash -> values? (The key isn't
            really an "ord" (I think?) because it increases by more than 1
            each time... it's more like an address since it references an
            address in the byte-pool space).
          • We should advertise the limits in the jdocs – limited to <= 2GB
            total byte storage, each key must be <= BLOCK SIZE-2 in length.
          • Can we have sortedEntries() not allocate a new iterator object?
            Ie, just return the sorted bytesStart int[]? (This is what's done
            today, and, for term vectors on small docs, this method is pretty
            hot). And the javadocs for this should be stronger – it's not
            that the behaviour is undefined after, it's that you must .clear()
            after you're done consume the sorted entries.
          Show
          Michael McCandless added a comment - Patch looks good Simon – some ideas: In the class jdocs, I think state that this is basically a Map<BytesRef,int>? Maybe we also move ByteBlockPool --> oal.util? Maybe move out the ByteBlockAllocator to its own class (in util)? RecyclingByteBlockAllocator? Can we have DocumentsWriter share the ByteBlockAllocator? (Right now it's dup'd code since DW also implements this). Maybe rename ords -> keys? And hash -> values? (The key isn't really an "ord" (I think?) because it increases by more than 1 each time... it's more like an address since it references an address in the byte-pool space). We should advertise the limits in the jdocs – limited to <= 2GB total byte storage, each key must be <= BLOCK SIZE-2 in length. Can we have sortedEntries() not allocate a new iterator object? Ie, just return the sorted bytesStart int[]? (This is what's done today, and, for term vectors on small docs, this method is pretty hot). And the javadocs for this should be stronger – it's not that the behaviour is undefined after, it's that you must .clear() after you're done consume the sorted entries.
          Hide
          Michael McCandless added a comment -

          make sure JIT doesn't play nasty tricks with us again.

          What would we do if this happens?

          Cry?

          Or... install Harmony and see if it has the same problem and if so submit a patch to them to fix it

          Show
          Michael McCandless added a comment - make sure JIT doesn't play nasty tricks with us again. What would we do if this happens? Cry? Or... install Harmony and see if it has the same problem and if so submit a patch to them to fix it
          Hide
          Simon Willnauer added a comment -

          In the class jdocs, I think state that this is basically a Map<BytesRef,int>?

          yeah that simplifies it - will do.

          Maybe we also move ByteBlockPool --> oal.util?

          yeah I did that already - that makes totally sense

          Maybe move out the ByteBlockAllocator to its own class (in util)? RecyclingByteBlockAllocator?

          +1 yeah I like that - I also think we should allow to pass the blockpool to the byteshash instead of the allocator. From what I can tell now I think this is necessary for the refactoring anyway since we share pools with secondary TermsHash instances in the termvector case.

          Maybe rename ords -> keys? And hash -> values? (The key isn't
          really an "ord" (I think?) because it increases by more than 1
          each time... it's more like an address since it references an
          address in the byte-pool space).

          yeah that depends how you see it - the array index really is the ord though. but I like those names. I will change.

          We should advertise the limits in the jdocs - limited to <= 2GB
          total byte storage, each key must be <= BLOCK SIZE-2 in length.

          I think I have done the latter already but I will add the other too.

          Can we have sortedEntries() not allocate a new iterator object?
          Ie, just return the sorted bytesStart int[]? (This is what's done
          today, and, for term vectors on small docs, this method is pretty
          hot). And the javadocs for this should be stronger - it's not
          that the behaviour is undefined after, it's that you must .clear()
          after you're done consume the sorted entries.

          Ah I see - good point. I think what you refer to is public int[] sort(Comparator<BytesRef> comp) - the iterator one is just more convenient one. I will change though.

          thanks mike!

          Show
          Simon Willnauer added a comment - In the class jdocs, I think state that this is basically a Map<BytesRef,int>? yeah that simplifies it - will do. Maybe we also move ByteBlockPool --> oal.util? yeah I did that already - that makes totally sense Maybe move out the ByteBlockAllocator to its own class (in util)? RecyclingByteBlockAllocator? +1 yeah I like that - I also think we should allow to pass the blockpool to the byteshash instead of the allocator. From what I can tell now I think this is necessary for the refactoring anyway since we share pools with secondary TermsHash instances in the termvector case. Maybe rename ords -> keys? And hash -> values? (The key isn't really an "ord" (I think?) because it increases by more than 1 each time... it's more like an address since it references an address in the byte-pool space). yeah that depends how you see it - the array index really is the ord though. but I like those names. I will change. We should advertise the limits in the jdocs - limited to <= 2GB total byte storage, each key must be <= BLOCK SIZE-2 in length. I think I have done the latter already but I will add the other too. Can we have sortedEntries() not allocate a new iterator object? Ie, just return the sorted bytesStart int[]? (This is what's done today, and, for term vectors on small docs, this method is pretty hot). And the javadocs for this should be stronger - it's not that the behaviour is undefined after, it's that you must .clear() after you're done consume the sorted entries. Ah I see - good point. I think what you refer to is public int[] sort(Comparator<BytesRef> comp) - the iterator one is just more convenient one. I will change though. thanks mike!
          Hide
          Simon Willnauer added a comment -

          Attaching my current state for feedback and iteration.

          • factored out ByteBlockAllocator from DocumentsWriter
          • moved ByteBlockPool to o.a.l.util
          • added RecyclingByteBlockAllocator which can be used with or without synchronization. IMO the DummyConcurrentLock will be optimized away so that his might be super low cost. - feedback for that would more than welcome.
          • addressed all the comments from mike - thanks again
          • added more tests
          • cut over constants from DocumentsWriter to ByteBlockPool

          TermsHashPerField is next.... feedback welcome.

          simon

          Show
          Simon Willnauer added a comment - Attaching my current state for feedback and iteration. factored out ByteBlockAllocator from DocumentsWriter moved ByteBlockPool to o.a.l.util added RecyclingByteBlockAllocator which can be used with or without synchronization. IMO the DummyConcurrentLock will be optimized away so that his might be super low cost. - feedback for that would more than welcome. addressed all the comments from mike - thanks again added more tests cut over constants from DocumentsWriter to ByteBlockPool TermsHashPerField is next.... feedback welcome. simon
          Hide
          Jason Rutherglen added a comment -

          An API change to BBP that would be useful is instead of passing in the "size in bytes" to newSlice, it'd be more useful if the level and/or the size were passed in. In fact, throughout the codebase, a level, specifically the first, is all that is passed into the newSlice method. The utility of this change is, I'm recording the level of the last slice for the skip list in LUCENE-2312.

          Show
          Jason Rutherglen added a comment - An API change to BBP that would be useful is instead of passing in the "size in bytes" to newSlice, it'd be more useful if the level and/or the size were passed in. In fact, throughout the codebase, a level, specifically the first, is all that is passed into the newSlice method. The utility of this change is, I'm recording the level of the last slice for the skip list in LUCENE-2312 .
          Hide
          Jason Rutherglen added a comment -

          Simon, the patch looks like it's ready for the next stage, ie, TermsHashPerField deparchment.

          Show
          Jason Rutherglen added a comment - Simon, the patch looks like it's ready for the next stage, ie, TermsHashPerField deparchment.
          Hide
          Simon Willnauer added a comment -

          We are almost there. I factored out ByteRefHash out of TermsHashPerField just having two "nocommit" parts left in the code I need to find a solution for.

          • there needs to be a way to communicate the byte usage up to DocumentsWriter which I haven't explored yet
          • textStarts in ParallelPostingsArray needs to be replaced since it is already maintained in ByteRefHash. I will need to look closer into that but suggestions are welcome. One way to do it would be to attach a reference to BRH instead of the textStart - but that is a naive suggestion since I haven't looked into that in more detail.

          All tests are passing so far and TermsHashPerField looks somewhat cleaner. I will work on fixing those nocommits and run some indexing perf test against the patch.

          Show
          Simon Willnauer added a comment - We are almost there. I factored out ByteRefHash out of TermsHashPerField just having two "nocommit" parts left in the code I need to find a solution for. there needs to be a way to communicate the byte usage up to DocumentsWriter which I haven't explored yet textStarts in ParallelPostingsArray needs to be replaced since it is already maintained in ByteRefHash. I will need to look closer into that but suggestions are welcome. One way to do it would be to attach a reference to BRH instead of the textStart - but that is a naive suggestion since I haven't looked into that in more detail. All tests are passing so far and TermsHashPerField looks somewhat cleaner. I will work on fixing those nocommits and run some indexing perf test against the patch.
          Hide
          Michael McCandless added a comment -

          Maybe rename ords -> keys? And hash -> values? (The key isn't really an "ord" (I think?) because it increases by more than 1 each time... it's more like an address since it references an address in the byte-pool space).

          yeah that depends how you see it - the array index really is the ord though. but I like those names. I will change.

          Duh, I agree – the new names are confusing!! Sorry. I was
          confused... you are right that what's now called "keys" are in fact
          really ords! They are always incr'd by one, on adding a new one.

          How about renaming key back to ord? And then maybe rename values to
          bytesStart? And in their decls add comments saying they are indexed
          by hash code? And maybe rename addByOffset -> addByBytesStart?

          • On the nocommit in ByteBlockPool – I think that's fine? It's an
            internal class....
          • The nocommit in BytesRefHash seems wrong? (Ie, compact is used
            internally)... though maybe we make it private if it's not used
            externally?
          • On the "nocommit factor this out!" in THPF.java... I agree, the
            postingsArray.textStarts should go away right? Ie, it's a
            [wasteful] copy of what the BytesRefHash is already storing?
          • Can we impl BytesRefHash.bytesUsed as an AtomicLong (hmm maybe
            AtomicInt – none of these classes can address > 2GB)? Then the
            pool would add in blockSize every time it binds a new block. That
            method (DW.bytesUsed) is called alot – at least once on every
            addDoc.
          • I'm confused again – when do we use RecyclingByteBlockAllocator
            from a single thread...? Ie, why did the sync need to be
            conditional for this class, again....? It seems like we always
            need it sync'd (both the main pool & per-doc pool need this)? If
            so we can simplify and make these methods sync'd?
          Show
          Michael McCandless added a comment - Maybe rename ords -> keys? And hash -> values? (The key isn't really an "ord" (I think?) because it increases by more than 1 each time... it's more like an address since it references an address in the byte-pool space). yeah that depends how you see it - the array index really is the ord though. but I like those names. I will change. Duh, I agree – the new names are confusing!! Sorry. I was confused... you are right that what's now called "keys" are in fact really ords! They are always incr'd by one, on adding a new one. How about renaming key back to ord? And then maybe rename values to bytesStart? And in their decls add comments saying they are indexed by hash code? And maybe rename addByOffset -> addByBytesStart? On the nocommit in ByteBlockPool – I think that's fine? It's an internal class.... The nocommit in BytesRefHash seems wrong? (Ie, compact is used internally)... though maybe we make it private if it's not used externally? On the "nocommit factor this out!" in THPF.java... I agree, the postingsArray.textStarts should go away right? Ie, it's a [wasteful] copy of what the BytesRefHash is already storing? Can we impl BytesRefHash.bytesUsed as an AtomicLong (hmm maybe AtomicInt – none of these classes can address > 2GB)? Then the pool would add in blockSize every time it binds a new block. That method (DW.bytesUsed) is called alot – at least once on every addDoc. I'm confused again – when do we use RecyclingByteBlockAllocator from a single thread...? Ie, why did the sync need to be conditional for this class, again....? It seems like we always need it sync'd (both the main pool & per-doc pool need this)? If so we can simplify and make these methods sync'd?
          Hide
          Simon Willnauer added a comment -

          How about renaming key back to ord? And then maybe rename values to
          bytesStart? And in their decls add comments saying they are indexed
          by hash code? And maybe rename addByOffset -> addByBytesStart?

          I don't like addByBytesStart I would like to keep offset since it really is an offset into the pool. addByPoolOffset?
          The names ord and bytesStart are a good compromise lets shoot for that.

          On the nocommit in ByteBlockPool - I think that's fine? It's an
          internal class....

          you refer to this: // nocommit - public arrays are not nice! ?
          yeah that more of an style thing but if somebody changes them its their fault for being stupid I guess.

          The nocommit in BytesRefHash seems wrong? (Ie, compact is used
          internally)... though maybe we make it private if it's not used
          externally?

          Ah yeah thats bogus - its from a previous iteration which was wrong as well, I will remove.

          On the "nocommit factor this out!" in THPF.java... I agree, the
          postingsArray.textStarts should go away right? Ie, it's a
          [wasteful] copy of what the BytesRefHash is already storing?

          Yeah that is the reason for that nocommit. Yet, I though about this a little and I have two options for this.

          • we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it.
            BytesRefHash would accept this class, don't have a good name for it but lets call it TextStartArray for now, and use it internally. It would call grow() once needed inside BytesRefHash and all the other code would be unchanged since PPA is a subclass.
          • the other way would be to bind the ByteRefHash to the postings array which seems odd to me though.

          More ideas?

          Can we impl BytesRefHash.bytesUsed as an AtomicLong (hmm maybe
          AtomicInt - none of these classes can address > 2GB)? Then the
          pool would add in blockSize every time it binds a new block. That
          method (DW.bytesUsed) is called alot - at least once on every
          addDoc.

          I did exactly that in the not yet uploaded patch. But I figured that it would maybe make more sense to use that AtomicInt in the allocator as well as in THPF or is that what you mean?

          I'm confused again - when do we use RecyclingByteBlockAllocator
          from a single thread...? Ie, why did the sync need to be
          conditional for this class, again....? It seems like we always
          need it sync'd (both the main pool & per-doc pool need this)? If
          so we can simplify and make these methods sync'd?

          man, I am sorry - I thought I will use this in LUCENE-2186 in a single threaded env but if so I should change it there if needed. I was one step ahead though.
          I will change and maybe have a second one if needed. Agree?

          simon

          Show
          Simon Willnauer added a comment - How about renaming key back to ord? And then maybe rename values to bytesStart? And in their decls add comments saying they are indexed by hash code? And maybe rename addByOffset -> addByBytesStart? I don't like addByBytesStart I would like to keep offset since it really is an offset into the pool. addByPoolOffset? The names ord and bytesStart are a good compromise lets shoot for that. On the nocommit in ByteBlockPool - I think that's fine? It's an internal class.... you refer to this: // nocommit - public arrays are not nice! ? yeah that more of an style thing but if somebody changes them its their fault for being stupid I guess. The nocommit in BytesRefHash seems wrong? (Ie, compact is used internally)... though maybe we make it private if it's not used externally? Ah yeah thats bogus - its from a previous iteration which was wrong as well, I will remove. On the "nocommit factor this out!" in THPF.java... I agree, the postingsArray.textStarts should go away right? Ie, it's a [wasteful] copy of what the BytesRefHash is already storing? Yeah that is the reason for that nocommit. Yet, I though about this a little and I have two options for this. we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it. BytesRefHash would accept this class, don't have a good name for it but lets call it TextStartArray for now, and use it internally. It would call grow() once needed inside BytesRefHash and all the other code would be unchanged since PPA is a subclass. the other way would be to bind the ByteRefHash to the postings array which seems odd to me though. More ideas? Can we impl BytesRefHash.bytesUsed as an AtomicLong (hmm maybe AtomicInt - none of these classes can address > 2GB)? Then the pool would add in blockSize every time it binds a new block. That method (DW.bytesUsed) is called alot - at least once on every addDoc. I did exactly that in the not yet uploaded patch. But I figured that it would maybe make more sense to use that AtomicInt in the allocator as well as in THPF or is that what you mean? I'm confused again - when do we use RecyclingByteBlockAllocator from a single thread...? Ie, why did the sync need to be conditional for this class, again....? It seems like we always need it sync'd (both the main pool & per-doc pool need this)? If so we can simplify and make these methods sync'd? man, I am sorry - I thought I will use this in LUCENE-2186 in a single threaded env but if so I should change it there if needed. I was one step ahead though. I will change and maybe have a second one if needed. Agree? simon
          Hide
          Michael McCandless added a comment -

          I don't like addByBytesStart I would like to keep offset since it really is an offset into the pool. addByPoolOffset?
          The names ord and bytesStart are a good compromise lets shoot for that.

          OK!

          we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it.

          This seems good? So, this would be the "store" that BRH manages... and by subclassing it you can have other parallel arrays storing anything, indexed by ord.

          I did exactly that in the not yet uploaded patch. But I figured that it would maybe make more sense to use that AtomicInt in the allocator as well as in THPF or is that what you mean?

          I think we should use it everywhere to track bytes used

          man, I am sorry - I thought I will use this in LUCENE-2186 in a single threaded env but if so I should change it there if needed. I was one step ahead though.

          I will change and maybe have a second one if needed. Agree?

          Ahh that's right I forgot the whole driver for this refactoring heh Yeah I think leave it sync'd for now and we can test if this hurts perf in LUCENE-2186? "Supposedly" uncontended locks are low-cost (but I'm not sure...).

          Show
          Michael McCandless added a comment - I don't like addByBytesStart I would like to keep offset since it really is an offset into the pool. addByPoolOffset? The names ord and bytesStart are a good compromise lets shoot for that. OK! we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it. This seems good? So, this would be the "store" that BRH manages... and by subclassing it you can have other parallel arrays storing anything, indexed by ord. I did exactly that in the not yet uploaded patch. But I figured that it would maybe make more sense to use that AtomicInt in the allocator as well as in THPF or is that what you mean? I think we should use it everywhere to track bytes used man, I am sorry - I thought I will use this in LUCENE-2186 in a single threaded env but if so I should change it there if needed. I was one step ahead though. I will change and maybe have a second one if needed. Agree? Ahh that's right I forgot the whole driver for this refactoring heh Yeah I think leave it sync'd for now and we can test if this hurts perf in LUCENE-2186 ? "Supposedly" uncontended locks are low-cost (but I'm not sure...).
          Hide
          Jason Rutherglen added a comment -

          we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it.

          This option, makes the most sense. ParallelByteStartsArray?

          Show
          Jason Rutherglen added a comment - we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it. This option, makes the most sense. ParallelByteStartsArray?
          Hide
          Simon Willnauer added a comment -

          Next iteration - seems to be very close!

          I have applied the following changes:

          • introduces a AtomicLong to track bytesUsed in DocumetnsWriter, TermsHashPerField, ByteRefHash and RecyclingByteBlockAllocator
          • Factored out a BytesStartArray class from BytesRefHash that manages the int[] holding the bytesStart offsets. TermsHashPerField subclasses and manages the ParallelPostingsArray through it.
          • remove remaining no-commits
          • made RecyclingbyteBlockAllocator synced by default (we use synchronized methods for it now)

          I run a quick Wikipedia 100k docs benchmark against trunk vs. LUCENE-2662 and the results are promising.

          version rec/sec elapsed sec avgUsedMem
          LUCENE-2662 717.30 139.41 536,682,592
          trunk 682.66 146.49 546,065,344

          I will run the 10M benchmark once I get back to this.

          Show
          Simon Willnauer added a comment - Next iteration - seems to be very close! I have applied the following changes: introduces a AtomicLong to track bytesUsed in DocumetnsWriter, TermsHashPerField, ByteRefHash and RecyclingByteBlockAllocator Factored out a BytesStartArray class from BytesRefHash that manages the int[] holding the bytesStart offsets. TermsHashPerField subclasses and manages the ParallelPostingsArray through it. remove remaining no-commits made RecyclingbyteBlockAllocator synced by default (we use synchronized methods for it now) I run a quick Wikipedia 100k docs benchmark against trunk vs. LUCENE-2662 and the results are promising. version rec/sec elapsed sec avgUsedMem LUCENE-2662 717.30 139.41 536,682,592 trunk 682.66 146.49 546,065,344 I will run the 10M benchmark once I get back to this.
          Hide
          Jason Rutherglen added a comment -

          Simon, looks good.

          Are we using:

          public int add(BytesRef bytes, int code)
          

          yet?

          Show
          Jason Rutherglen added a comment - Simon, looks good. Are we using: public int add(BytesRef bytes, int code) yet?
          Hide
          Simon Willnauer added a comment -

          Are we using:...

          yeah, look at TermsHashPerFields add() method

                 termID = bytesHash.add(termBytesRef, termAtt.toBytesRef(termBytesRef));
          

          simon

          Show
          Simon Willnauer added a comment - Are we using:... yeah, look at TermsHashPerFields add() method termID = bytesHash.add(termBytesRef, termAtt.toBytesRef(termBytesRef)); simon
          Hide
          Michael McCandless added a comment -

          I indexed 10M 1KB wikipedia docs, single threaded, and also see things a bit faster w/ the patch (10,353 docs/sec vs 10,182 docs/sec). Nice to have a refactor improve performance for a change, heh.

          The avgUsedMem was quite a bit higher (1.5GB vs 1.0GB), but, I'm not sure this stat is trustworthy.... I'll re-run w/ infoStream enabled to see if anything looks suspicious (eg, we are somehow not tracking bytes used correctly).

          Still, the resulting indices had identical structure (ie we seem to flush at exactly the same points), so I think bytes used is properly tracked.

          Show
          Michael McCandless added a comment - I indexed 10M 1KB wikipedia docs, single threaded, and also see things a bit faster w/ the patch (10,353 docs/sec vs 10,182 docs/sec). Nice to have a refactor improve performance for a change, heh. The avgUsedMem was quite a bit higher (1.5GB vs 1.0GB), but, I'm not sure this stat is trustworthy.... I'll re-run w/ infoStream enabled to see if anything looks suspicious (eg, we are somehow not tracking bytes used correctly). Still, the resulting indices had identical structure (ie we seem to flush at exactly the same points), so I think bytes used is properly tracked.
          Hide
          Michael McCandless added a comment -

          Still, the resulting indices had identical structure (ie we seem to flush at exactly the same points), so I think bytes used is properly tracked.

          Sorry, scratch that – I was inadvertently flushing by doc count, not by RAM usage. I'm re-running w/ flush-by-RAM to verify we flush at exactly the same points as trunk.

          Show
          Michael McCandless added a comment - Still, the resulting indices had identical structure (ie we seem to flush at exactly the same points), so I think bytes used is properly tracked. Sorry, scratch that – I was inadvertently flushing by doc count, not by RAM usage. I'm re-running w/ flush-by-RAM to verify we flush at exactly the same points as trunk.
          Hide
          Michael McCandless added a comment -

          In RecyclingByteBlockAllocator.recycleByteBlocks, if we cannot recycle all of the blocks (ie because it exceeds maxBufferedBlocks), we are failing to null out the entries in the incoming array?

          Also maybe rename pos -> freeCount? (pos is a little too generic?)

          Show
          Michael McCandless added a comment - In RecyclingByteBlockAllocator.recycleByteBlocks, if we cannot recycle all of the blocks (ie because it exceeds maxBufferedBlocks), we are failing to null out the entries in the incoming array? Also maybe rename pos -> freeCount? (pos is a little too generic?)
          Hide
          Robert Muir added a comment -

          Simon, thank you for renaming the 'utf8' variables here.

          Show
          Robert Muir added a comment - Simon, thank you for renaming the 'utf8' variables here.
          Hide
          Simon Willnauer added a comment -

          Simon, thank you for renaming the 'utf8' variables here.

          YW

          In RecyclingByteBlockAllocator.recycleByteBlocks, if we cannot recycle all of the blocks (ie because it exceeds maxBufferedBlocks), we are failing to null out the entries in the incoming array?

          Ahh you are right - I will fix.

          Also maybe rename pos -> freeCount? (pos is a little too generic?)

          I mean its internal though but I see your point.

          thanks for reviewing it closely.

          The avgUsedMem was quite a bit higher (1.5GB vs 1.0GB), but, I'm not sure this stat is trustworthy.... I'll re-run w/ infoStream enabled to see if anything looks suspicious (eg, we are somehow not tracking bytes used correctly).

          hmm I will dig once I get back to my workstation.

          simon

          Show
          Simon Willnauer added a comment - Simon, thank you for renaming the 'utf8' variables here. YW In RecyclingByteBlockAllocator.recycleByteBlocks, if we cannot recycle all of the blocks (ie because it exceeds maxBufferedBlocks), we are failing to null out the entries in the incoming array? Ahh you are right - I will fix. Also maybe rename pos -> freeCount? (pos is a little too generic?) I mean its internal though but I see your point. thanks for reviewing it closely. The avgUsedMem was quite a bit higher (1.5GB vs 1.0GB), but, I'm not sure this stat is trustworthy.... I'll re-run w/ infoStream enabled to see if anything looks suspicious (eg, we are somehow not tracking bytes used correctly). hmm I will dig once I get back to my workstation. simon
          Hide
          Michael McCandless added a comment -

          OK my 2nd indexing test (10M wikipedia docs, flush @ 256 MB ram used) finished and trunk/patch are essentially the same throughput, and, all flushes happened at identical points. So I think we are good to go...

          Nice work Simon!

          Show
          Michael McCandless added a comment - OK my 2nd indexing test (10M wikipedia docs, flush @ 256 MB ram used) finished and trunk/patch are essentially the same throughput, and, all flushes happened at identical points. So I think we are good to go... Nice work Simon!
          Hide
          Michael McCandless added a comment -

          I also ran a test w/ 5 threads – they are close (22,402 docs/sec for patch, 22,868 docs/sec for trunk), and this time avgUsedMem is closer (811 MB for trunk, 965 MB for patch).

          I don't think the avgUsedMem is that meaningful – it takes the max of Runtime.totalMemory() - Runtime.freeMemory() (which includes garbage I think), after each completed task, and then averages across all tasks. In my case I think it's averaging 1 measure per thread, so it's really sort of measuring how much garbage there happened to be at the time.

          Show
          Michael McCandless added a comment - I also ran a test w/ 5 threads – they are close (22,402 docs/sec for patch, 22,868 docs/sec for trunk), and this time avgUsedMem is closer (811 MB for trunk, 965 MB for patch). I don't think the avgUsedMem is that meaningful – it takes the max of Runtime.totalMemory() - Runtime.freeMemory() (which includes garbage I think), after each completed task, and then averages across all tasks. In my case I think it's averaging 1 measure per thread, so it's really sort of measuring how much garbage there happened to be at the time.
          Hide
          Michael McCandless added a comment -

          I instrumented trunk & the patch to see how many times we do new byte[bufferSize] while building 5M index, and they both alloc the same number of byte[] from the BBA. So I don't think we have a memory issue...

          Show
          Michael McCandless added a comment - I instrumented trunk & the patch to see how many times we do new byte [bufferSize] while building 5M index, and they both alloc the same number of byte[] from the BBA. So I don't think we have a memory issue...
          Hide
          Simon Willnauer added a comment -

          This patch fixes nulling out the recycled but not reused byte blocks in RecyclingByteBlockAllocator.

          I thing we are ready to go I will commit to trunk soon. I don't think we need a CHANGES.TXT here - at least I can not find any section this refactoring would fit to.

          simon

          Show
          Simon Willnauer added a comment - This patch fixes nulling out the recycled but not reused byte blocks in RecyclingByteBlockAllocator. I thing we are ready to go I will commit to trunk soon. I don't think we need a CHANGES.TXT here - at least I can not find any section this refactoring would fit to. simon
          Hide
          Simon Willnauer added a comment -

          Committed to trunk in rev. 1003790

          @Jason: do you need that merged into Realtime-Branch or is buschmi going to do that? Otherwise I can help too

          I will keep it open until this is merged into Realtime Branch

          Show
          Simon Willnauer added a comment - Committed to trunk in rev. 1003790 @Jason: do you need that merged into Realtime-Branch or is buschmi going to do that? Otherwise I can help too I will keep it open until this is merged into Realtime Branch
          Hide
          Jason Rutherglen added a comment -

          Simon, I'm going to get deletes working, tests passing using maps in the RT branch, then we can integrate. This'll probably be best.

          Show
          Jason Rutherglen added a comment - Simon, I'm going to get deletes working, tests passing using maps in the RT branch, then we can integrate. This'll probably be best.
          Hide
          Simon Willnauer added a comment -

          Simon, I'm going to get deletes working, tests passing using maps in the RT branch, then we can integrate. This'll probably be best.

          Jason, I suggest you create a separate issue something like "Integrate BytesRefHash in Realtime Branch" and I will take care of it. I think this issue had a clear target to factor out the hash table out of TermsHashPerField and we should close it. lets use a new one to track the integration.

          Thoughts?

          Simon

          Show
          Simon Willnauer added a comment - Simon, I'm going to get deletes working, tests passing using maps in the RT branch, then we can integrate. This'll probably be best. Jason, I suggest you create a separate issue something like "Integrate BytesRefHash in Realtime Branch" and I will take care of it. I think this issue had a clear target to factor out the hash table out of TermsHashPerField and we should close it. lets use a new one to track the integration. Thoughts? Simon
          Hide
          Jason Rutherglen added a comment -

          Lets commit this to trunk. We need to merge in all of trunk to the RT branch, or vice versa at some point anyways. This patch could be a part of that bulk merge-in, or we can simply do it now.

          Show
          Jason Rutherglen added a comment - Lets commit this to trunk. We need to merge in all of trunk to the RT branch, or vice versa at some point anyways. This patch could be a part of that bulk merge-in, or we can simply do it now.
          Hide
          Michael McCandless added a comment -

          This was already committed to trunk...

          Show
          Michael McCandless added a comment - This was already committed to trunk...
          Hide
          Mathias Walter added a comment -

          Why is this issue still open, if the patch was already committed to trunk?

          Show
          Mathias Walter added a comment - Why is this issue still open, if the patch was already committed to trunk?
          Hide
          Simon Willnauer added a comment -

          Why is this issue still open, if the patch was already committed to trunk?

          see my comment above:

          I will keep it open until this is merged into Realtime Branch

          Show
          Simon Willnauer added a comment - Why is this issue still open, if the patch was already committed to trunk? see my comment above: I will keep it open until this is merged into Realtime Branch
          Hide
          Simon Willnauer added a comment -

          I will keep it open until this is merged into Realtime Branch

          I think we should really close this since RT branch is not very active right now....

          Show
          Simon Willnauer added a comment - I will keep it open until this is merged into Realtime Branch I think we should really close this since RT branch is not very active right now....
          Hide
          Michael Busch added a comment -

          I think we should really close this since RT branch is not very active right now....

          Sorry about that. I need to merge trunk into RT, then I'll get this change too. It's a big merge though with tons of conflicts...

          Show
          Michael Busch added a comment - I think we should really close this since RT branch is not very active right now.... Sorry about that. I need to merge trunk into RT, then I'll get this change too. It's a big merge though with tons of conflicts...
          Hide
          Simon Willnauer added a comment -

          Sorry about that. I need to merge trunk into RT, then I'll get this change too. It's a big merge though with tons of conflicts...

          HA! good to see you here! have fun with the merge

          Show
          Simon Willnauer added a comment - Sorry about that. I need to merge trunk into RT, then I'll get this change too. It's a big merge though with tons of conflicts... HA! good to see you here! have fun with the merge
          Hide
          Uwe Schindler added a comment -

          HA! good to see you here! have fun with the merge

          He is working hard, it's 4:45 am in California

          Show
          Uwe Schindler added a comment - HA! good to see you here! have fun with the merge He is working hard, it's 4:45 am in California
          Hide
          Simon Willnauer added a comment -

          He is working hard, it's 4:45 am in California

          true but he is in germany

          Show
          Simon Willnauer added a comment - He is working hard, it's 4:45 am in California true but he is in germany
          Hide
          Michael Busch added a comment -

          Yeah sitting in Stuttgart, going to hit the Weihnachtsmarkt soon - let's see how the merge goes after several glasses of Gluehwein

          Show
          Michael Busch added a comment - Yeah sitting in Stuttgart, going to hit the Weihnachtsmarkt soon - let's see how the merge goes after several glasses of Gluehwein

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Jason Rutherglen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development