Solr
  1. Solr
  2. SOLR-2382 DIH Cache Improvements
  3. SOLR-2947

DIH caching bug - EntityRunner destroys child entity processor

    Details

      Description

      My intention is fix multithread import with SQL cache. Here is the 2nd stage. If I enable DocBuilder.EntityRunner flow even for single thread, it breaks the pretty basic functionality: parent-child join.
      the reason is line 473 entityProcessor.destroy(); breaks children entityProcessor.

      see attachement comments for more details.

      1. dih-cache-threads-enabling-bug.patch
        0.6 kB
        Mikhail Khludnev
      2. dih-cache-destroy-on-threads-fix.patch
        5 kB
        Mikhail Khludnev
      3. SOLR-2947.patch
        5 kB
        Mikhail Khludnev
      4. SOLR-2947.patch
        7 kB
        James Dyer
      5. SOLR-2947.patch
        32 kB
        Mikhail Khludnev
      6. SOLR-2947.patch
        13 kB
        Mikhail Khludnev
      7. SOLR-2947.patch
        15 kB
        Mikhail Khludnev
      8. SOLR-2947.patch
        12 kB
        James Dyer

        Activity

        Hide
        Robert Muir added a comment -

        Sorry James, my mistake, I didn't realize that!

        Show
        Robert Muir added a comment - Sorry James, my mistake, I didn't realize that!
        Hide
        James Dyer added a comment -

        This bug was caused by SOLR-2382 which is trunk-only. Do we need a CHANGES.txt entry for that?

        Show
        James Dyer added a comment - This bug was caused by SOLR-2382 which is trunk-only. Do we need a CHANGES.txt entry for that?
        Hide
        Robert Muir added a comment -

        Hi James: I think we should add a CHANGES.txt entry for this fix?

        Show
        Robert Muir added a comment - Hi James: I think we should add a CHANGES.txt entry for this fix?
        Hide
        James Dyer added a comment -

        Trunk Only: r1245014 & r1245018. Thank you Mikhail (now to the next one ).

        Show
        James Dyer added a comment - Trunk Only: r1245014 & r1245018. Thank you Mikhail (now to the next one ).
        Hide
        James Dyer added a comment -

        I will commit this shortly.

        Show
        James Dyer added a comment - I will commit this shortly.
        Hide
        James Dyer added a comment -

        I not set up yet and not sure how long it'll be. But I do want to get the DIH bugs taken care of asap.

        Show
        James Dyer added a comment - I not set up yet and not sure how long it'll be. But I do want to get the DIH bugs taken care of asap.
        Hide
        Mikhail Khludnev added a comment -

        James,
        My congratulations for becoming a committer!

        Can't you push forward this one and small SOLR-2933 too to unlock brutal SOLR-3011?

        Regards

        Show
        Mikhail Khludnev added a comment - James, My congratulations for becoming a committer! Can't you push forward this one and small SOLR-2933 too to unlock brutal SOLR-3011 ? Regards
        Hide
        Mikhail Khludnev added a comment -

        Small point here but I prefer the TestEphemeralCache changes I made in the Dec 11, 2011 patch version.

        James,

        • my test reads base config from xml file and patches number of thread for every run, as well as your test does. So, there is no much difference here.
        • I agree that the random number of threads can be much reasonable than the constant one. Anyway it should be @Ignore ed until you know.
        • The really significant advantage of my test is DestroyCountCache which assert once-and-only-once destroy() call for caches.

        Regards

        Show
        Mikhail Khludnev added a comment - Small point here but I prefer the TestEphemeralCache changes I made in the Dec 11, 2011 patch version. James, my test reads base config from xml file and patches number of thread for every run, as well as your test does. So, there is no much difference here. I agree that the random number of threads can be much reasonable than the constant one. Anyway it should be @Ignore ed until you know. The really significant advantage of my test is DestroyCountCache which assert once-and-only-once destroy() call for caches. Regards
        Hide
        James Dyer added a comment -

        I think this patch does the right thing here, calling "destroy()" down the hierarchy of EntityProcessors, but waiting until after doc-building is complete. While I had it this way for the single-threaded code, I punted on the multi-threaded case simply hoping that because the unit tests were passing then everything would be alright . I appreciate the effort to improve the DIH multithreaded code. We really need to get rid of bugs like this and long-term it would pay if we could try and make the code more maintainable, get better test coverage, etc.

        An example is the new "children()" method...using just the first ThreadedEntityProcessorWrapper from the list I think is valid because the "children" will be same on all the threads. But then again, looking at how this all gets populated in the ThreadedEntityProcessorWrapper constructor, the answer (to me) isn't obvious. Best I can say is this is probably correct and certainly a vast improvement than what is currently in Trunk.

        Small point here but I prefer the TestEphemeralCache changes I made in the Dec 11, 2011 patch version. I switched to building the config file on-the-fly and testMultiThreaded() uses a random number of threads instead of always using 10. Of course, if we go with this then we'd need to add "@Ignore" for testMultiThreaded() until SOLR-3011 can be commited.

        Show
        James Dyer added a comment - I think this patch does the right thing here, calling "destroy()" down the hierarchy of EntityProcessors, but waiting until after doc-building is complete. While I had it this way for the single-threaded code, I punted on the multi-threaded case simply hoping that because the unit tests were passing then everything would be alright . I appreciate the effort to improve the DIH multithreaded code. We really need to get rid of bugs like this and long-term it would pay if we could try and make the code more maintainable, get better test coverage, etc. An example is the new "children()" method...using just the first ThreadedEntityProcessorWrapper from the list I think is valid because the "children" will be same on all the threads. But then again, looking at how this all gets populated in the ThreadedEntityProcessorWrapper constructor, the answer (to me) isn't obvious. Best I can say is this is probably correct and certainly a vast improvement than what is currently in Trunk. Small point here but I prefer the TestEphemeralCache changes I made in the Dec 11, 2011 patch version. I switched to building the config file on-the-fly and testMultiThreaded() uses a random number of threads instead of always using 10. Of course, if we go with this then we'd need to add "@Ignore" for testMultiThreaded() until SOLR-3011 can be commited.
        Hide
        Mikhail Khludnev added a comment -

        add comments, assert, and root entity only test. though not really significant update.

        Show
        Mikhail Khludnev added a comment - add comments, assert, and root entity only test. though not really significant update.
        Hide
        Mikhail Khludnev added a comment -

        ...why is it sufficient to get the children from only the first element of that iterator?

        • yes it is. just because DIH code works only with threads=1. assert entityProcessorWrapper.size()==1. but if we set threads to 2 or more, every TEPW will have own collection of children' entity runners (and every ER will have own EP). I address it in SOLR-3011 see my explanation above "DocBuilder.java/object instantiating".

        are there any situations where the iterator may not return any elements?

        • EntityRunner.<init> declares size of this list 1 by default. if user configured it to 0 EntityRunner.run() method will fail due to entityProcessorWrapper.get(0).

        at first i thought this was because all of the ThreadedEntityProcessorWrapper instances were initialized with identical EntityProcessors

        done in SOLR-3011 patch.

        for (int i = 0; i < threads; i++) {
                thepw = new ThreadedEntityProcessorWrapper(       ...
                        childrenRunners, i);           <-- identical EntityProcessors
                entityProcessorWrapper.add(thepw);
        }
        

        Is this something that's just a hack that only works with 1 thread until/unless SOLR-3011 is resolved?

        in according to subj it's a fix for a bug introduced by SOLR-2384. threads=2 didn't work before, but now threads=1 doesn't work too. I removed wrong EP destroying, and implement destroying which works now for threads=1, and will work for threads=10 after SOLR-3011.

        will this break things for people using multithreaded DIH without nested entities?

        test passed, you know. it even will destroy root EP properly because the collecting of ERs in doFullDump() starts from placing root ER into result.

        Either way: a comment clarifying that monstrosity would be a good idea.

        I can't distinguish which monstrosity you refer to.

        Show
        Mikhail Khludnev added a comment - ...why is it sufficient to get the children from only the first element of that iterator? yes it is. just because DIH code works only with threads=1. assert entityProcessorWrapper.size()==1. but if we set threads to 2 or more, every TEPW will have own collection of children' entity runners (and every ER will have own EP). I address it in SOLR-3011 see my explanation above "DocBuilder.java/object instantiating". are there any situations where the iterator may not return any elements? EntityRunner.<init> declares size of this list 1 by default. if user configured it to 0 EntityRunner.run() method will fail due to entityProcessorWrapper.get(0). at first i thought this was because all of the ThreadedEntityProcessorWrapper instances were initialized with identical EntityProcessors done in SOLR-3011 patch. for ( int i = 0; i < threads; i++) { thepw = new ThreadedEntityProcessorWrapper( ... childrenRunners, i); <-- identical EntityProcessors entityProcessorWrapper.add(thepw); } Is this something that's just a hack that only works with 1 thread until/unless SOLR-3011 is resolved? in according to subj it's a fix for a bug introduced by SOLR-2384 . threads=2 didn't work before, but now threads=1 doesn't work too. I removed wrong EP destroying, and implement destroying which works now for threads=1, and will work for threads=10 after SOLR-3011 . will this break things for people using multithreaded DIH without nested entities? test passed, you know. it even will destroy root EP properly because the collecting of ERs in doFullDump() starts from placing root ER into result. Either way: a comment clarifying that monstrosity would be a good idea. I can't distinguish which monstrosity you refer to.
        Hide
        Hoss Man added a comment -

        I decided that my 32K patch from 28/Dec is to huge. I'm attaching the shorter version.

        Mikhail: as someone who wants to help get DIH improvements/bug fixes commited, but probably understands less about DIH internals then you and James, i thank you for splitting out these issues and explaining what's going on, and pressing forward with patches that include tests.

        The latest "SOLR-2947.patch" looks fairly good to me, and all tests pass (with the caveat that the @Ignored test depends on SOLR-3011 - although that should really be noted in the @Ignore msg) the one thing that really confuses me is the implementation of children(), it smells extremely fishy...

        +    Collection<EntityRunner> children(){
        +        return entityProcessorWrapper.iterator().next().children.values();
        +    } 
        

        ...why is it sufficient to get the children from only the first element of that iterator? are there any situations where the iterator may not return any elements?

        I'm guessing the answer to the later is "no" because that would mean 0 threads, correct? (in which case, let's toss a java assert in there for good measure).

        As for the first question: at first i thought this was because all of the ThreadedEntityProcessorWrapper instances were initialized with identical EntityProcessors so any one would due (in which why not just use the entityProcessor handing directly off the parent EntityRunning) – but poking around ThreadedEntityProcessorWrapper i see that they seem to always create new EntityRunner instance, so i'm at a loss to understand how that "children()" method makes sense.

        Is this something that's just a hack that only works with 1 thread until/unless SOLR-3011 is resolved? will this break things for people using multithreaded DIH without nested entities? (i don't think so – because that would contradict the premise of there being any child entities that need their entityProcessor to be desctoryed – but i wanted to sanity check the question). Either way: a comment clarifying that monstrosity would be a good idea.

        James: are you comfortable with the latest patch?

        Show
        Hoss Man added a comment - I decided that my 32K patch from 28/Dec is to huge. I'm attaching the shorter version. Mikhail: as someone who wants to help get DIH improvements/bug fixes commited, but probably understands less about DIH internals then you and James, i thank you for splitting out these issues and explaining what's going on, and pressing forward with patches that include tests. The latest " SOLR-2947 .patch" looks fairly good to me, and all tests pass (with the caveat that the @Ignored test depends on SOLR-3011 - although that should really be noted in the @Ignore msg) the one thing that really confuses me is the implementation of children(), it smells extremely fishy... + Collection<EntityRunner> children(){ + return entityProcessorWrapper.iterator().next().children.values(); + } ...why is it sufficient to get the children from only the first element of that iterator? are there any situations where the iterator may not return any elements? I'm guessing the answer to the later is "no" because that would mean 0 threads, correct? (in which case, let's toss a java assert in there for good measure). As for the first question: at first i thought this was because all of the ThreadedEntityProcessorWrapper instances were initialized with identical EntityProcessors so any one would due (in which why not just use the entityProcessor handing directly off the parent EntityRunning) – but poking around ThreadedEntityProcessorWrapper i see that they seem to always create new EntityRunner instance, so i'm at a loss to understand how that "children()" method makes sense. Is this something that's just a hack that only works with 1 thread until/unless SOLR-3011 is resolved? will this break things for people using multithreaded DIH without nested entities? (i don't think so – because that would contradict the premise of there being any child entities that need their entityProcessor to be desctoryed – but i wanted to sanity check the question). Either way: a comment clarifying that monstrosity would be a good idea. James: are you comfortable with the latest patch?
        Hide
        Mikhail Khludnev added a comment -

        Hello,

        I decided that my 32K patch from 28/Dec is to huge. I'm attaching the shorter version. It's quite close to the James's one, but I put @Ignore for threads="10" to keep build green after apply; and I also assert that EntityProcessors are properly destroyed.
        I'm going to create sibling issue for DIH multi-threading and put most of my prev patch there.

        Show
        Mikhail Khludnev added a comment - Hello, I decided that my 32K patch from 28/Dec is to huge. I'm attaching the shorter version. It's quite close to the James's one, but I put @Ignore for threads="10" to keep build green after apply; and I also assert that EntityProcessors are properly destroyed. I'm going to create sibling issue for DIH multi-threading and put most of my prev patch there.
        Hide
        Mikhail Khludnev added a comment -

        Ok. here is the patch, which fixes issue with destroy() and problem with multiple threads and CachedSqlEntityProcessor.

        Code

        Context.java, ContextImpl.java

        • removed SCOPE_DOC constant. I can't find any usages. Old impl isn't thread safe. We can implement it thread safe if you want. Let me know if it's necessary.
        • Pay attention that ContextImpl.putVal() ignores the scope provided. It should be tracked separately let me know if you like me to raise it.

        DataImporter.java

        I added DocBuilder.destroy() to stop thread pool after all work is done. I'm bothered by testCase's warns about "thread leaks"

        DIHCacheSupport.java

        it just introduces a getter. But I generated diff against uncommitted SOLR-2961, so line numbers can be wrong, let me know I re-diff it.

        DocBuilder.java

        • EntityRunner stops create EntityProcessors and obtains it from constructor args
        • proper destroying EntityProcessors
        • EntityRunner.docWrapper is removed as not-thread-safe. it's passed explicitly by method arguments
        • EntityRunner.entityEnded was't thread-safe too. moved into ThreadedEntityProcessorWrapper
        • object instantiating was drastically amended to be threadsafe
          • single EntityRunner per Entity
          • single EntityProcessor per EntityRunner
          • N ThreadedEntityProcessorWrapper per EntityRunner uses its' EntityProcessor as delegate
          • where N is number of threads specified at root entity (threads attr is prohibited for child entities)
          • ThreadedEntityProcessorWrapper are numbered by their positions in EntityRunner's tepw list
          • parent entity's ThreadedEntityProcessorWrapper always hits children's tepw with the same number as its' own
        • parent entity's ThreadedEntityProcessorWrapper always hits children's tepw by plain Java synchronous call (w/o thread pool)

        EntityProcessor.java,EntityProcessorBase.java

        isPaged() property has been introduced

        EntityProcessorWrapper.java

        protected transformRow() has been extracted from applyTransformer(). I need to reuse transformers logic for the paged flow but applyTransformer() has side-effect on rowcache field.

        ThreadedEntityProcessorWrapper.java

        in addition to all refactorings above (instantiating and field move). it contains the core idea of multithred cached entity processor:

        • after tepw obtains access to thread-unaware delegate entityProcessor it need to pull whole page - all children records belong to the current parent,
        • whole page is transformed and put into tepw.rowcahce, where they will be pulled later by the parent entity tepw

        Tests

        TestThreaded.java

        added full space test for CachedSqlEP for no, 1, 2, 10 (keep in mind 1 thread don't equal to no-threads)

        TestEphemeralCache.java

        add double destroy() check EntityProcessors

        dataimport-cache-ephemeral.xml

        specifies 10 threads and add double destroy() EntityProcessors

        Show
        Mikhail Khludnev added a comment - Ok. here is the patch, which fixes issue with destroy() and problem with multiple threads and CachedSqlEntityProcessor. Code Context.java, ContextImpl.java removed SCOPE_DOC constant. I can't find any usages. Old impl isn't thread safe. We can implement it thread safe if you want. Let me know if it's necessary. Pay attention that ContextImpl.putVal() ignores the scope provided . It should be tracked separately let me know if you like me to raise it. DataImporter.java I added DocBuilder.destroy() to stop thread pool after all work is done. I'm bothered by testCase's warns about "thread leaks" DIHCacheSupport.java it just introduces a getter. But I generated diff against uncommitted SOLR-2961 , so line numbers can be wrong, let me know I re-diff it. DocBuilder.java EntityRunner stops create EntityProcessors and obtains it from constructor args proper destroying EntityProcessors EntityRunner.docWrapper is removed as not-thread-safe. it's passed explicitly by method arguments EntityRunner.entityEnded was't thread-safe too. moved into ThreadedEntityProcessorWrapper object instantiating was drastically amended to be threadsafe single EntityRunner per Entity single EntityProcessor per EntityRunner N ThreadedEntityProcessorWrapper per EntityRunner uses its' EntityProcessor as delegate where N is number of threads specified at root entity (threads attr is prohibited for child entities) ThreadedEntityProcessorWrapper are numbered by their positions in EntityRunner's tepw list parent entity's ThreadedEntityProcessorWrapper always hits children's tepw with the same number as its' own parent entity's ThreadedEntityProcessorWrapper always hits children's tepw by plain Java synchronous call (w/o thread pool) EntityProcessor.java,EntityProcessorBase.java isPaged() property has been introduced EntityProcessorWrapper.java protected transformRow() has been extracted from applyTransformer(). I need to reuse transformers logic for the paged flow but applyTransformer() has side-effect on rowcache field. ThreadedEntityProcessorWrapper.java in addition to all refactorings above (instantiating and field move). it contains the core idea of multithred cached entity processor: after tepw obtains access to thread-unaware delegate entityProcessor it need to pull whole page - all children records belong to the current parent, whole page is transformed and put into tepw.rowcahce, where they will be pulled later by the parent entity tepw Tests TestThreaded.java added full space test for CachedSqlEP for no, 1, 2, 10 (keep in mind 1 thread don't equal to no-threads) TestEphemeralCache.java add double destroy() check EntityProcessors dataimport-cache-ephemeral.xml specifies 10 threads and add double destroy() EntityProcessors
        Hide
        Mikhail Khludnev added a comment -

        Is it your impression that DIH caching ever worked threaded, or do you think SOLR-2382 might have broken it?

        No. It didn't. Multiple threads don't work with CachedSqlEntityProcessor at 3.4 too.

        But I would think before that can happen all of these threaded bugs need to be worked through.

        ok. I'm going to address it.

        Show
        Mikhail Khludnev added a comment - Is it your impression that DIH caching ever worked threaded, or do you think SOLR-2382 might have broken it? No. It didn't. Multiple threads don't work with CachedSqlEntityProcessor at 3.4 too. But I would think before that can happen all of these threaded bugs need to be worked through. ok. I'm going to address it.
        Hide
        James Dyer added a comment -

        Mikhail,

        I appreciate your finding this bug and providing a fix. Indeed I wasn't sure I was handling "destroy()" correctly for the multi-threaded case in SOLR-2382 but then again all tests were passing.

        I think you're absolutely correct we don't have enough threaded tests. To help with this, I expanded "TestEphemeralCache" (see updated patch) to run it 3 times: first single-threaded, second single-threaded with 'threads=1', and third as truly multi-threaded (2-6 threads randomly chosen).

        Obviously the third case still fails but you indicate you plan on fixing that one next? That would be great if you could. Let me know if there is any way I can help although I admit I'm a bit unsure of myself in the DIH multi-threaded code. Is it your impression that DIH caching ever worked threaded, or do you think SOLR-2382 might have broken it?

        I also agree that it shouldn't use the "threaded" code if "threads=1" For now, it provides us with a nice testing scenario to fix the 'destroy()' semantics without having to tackle the multi-threaded cacheing beast.

        Although I'm with you that it would be great if we could just combine the single-threaded and multi-threaded cases into the same code. But I would think before that can happen all of these threaded bugs need to be worked through.

        I apologize for taking so long to look at this. On the other hand, I'm not going to be much help to you in getting this committed. But if I can help in any other way let me know.

        Show
        James Dyer added a comment - Mikhail, I appreciate your finding this bug and providing a fix. Indeed I wasn't sure I was handling "destroy()" correctly for the multi-threaded case in SOLR-2382 but then again all tests were passing. I think you're absolutely correct we don't have enough threaded tests. To help with this, I expanded "TestEphemeralCache" (see updated patch) to run it 3 times: first single-threaded, second single-threaded with 'threads=1', and third as truly multi-threaded (2-6 threads randomly chosen). Obviously the third case still fails but you indicate you plan on fixing that one next? That would be great if you could. Let me know if there is any way I can help although I admit I'm a bit unsure of myself in the DIH multi-threaded code. Is it your impression that DIH caching ever worked threaded, or do you think SOLR-2382 might have broken it? I also agree that it shouldn't use the "threaded" code if "threads=1" For now, it provides us with a nice testing scenario to fix the 'destroy()' semantics without having to tackle the multi-threaded cacheing beast. Although I'm with you that it would be great if we could just combine the single-threaded and multi-threaded cases into the same code. But I would think before that can happen all of these threaded bugs need to be worked through. I apologize for taking so long to look at this. On the other hand, I'm not going to be much help to you in getting this committed. But if I can help in any other way let me know.
        Hide
        Mikhail Khludnev added a comment -

        Clean up comment. Rename patch to conform to HowToContribute

        Show
        Mikhail Khludnev added a comment - Clean up comment. Rename patch to conform to HowToContribute
        Hide
        Mikhail Khludnev added a comment -

        a few thoughts

        • formally DataImporter.initEntity() shouldn't consider config as multithread if the only single thread is specified
        • I'm concerning for overall test coverage for EntityRunners. You see switching to them breaks the essential feature. Are they tested enough? is it worth to check the actual coverage? You see there is not tests, where child adds anything into parent. and
        • btw, can't we remove DocBuilder.buildDocument() and switch onto EntityRunners to avoid "second system" issue?
        • in TestEphemeralCache I'd like to declare parent as cacheable entity too. Just for check how it works.

        after this, the next fight is true multi-threading.

        Show
        Mikhail Khludnev added a comment - a few thoughts formally DataImporter.initEntity() shouldn't consider config as multithread if the only single thread is specified I'm concerning for overall test coverage for EntityRunners. You see switching to them breaks the essential feature. Are they tested enough? is it worth to check the actual coverage ? You see there is not tests, where child adds anything into parent. and btw, can't we remove DocBuilder.buildDocument() and switch onto EntityRunners to avoid "second system" issue? in TestEphemeralCache I'd like to declare parent as cacheable entity too. Just for check how it works. after this, the next fight is true multi-threading.
        Hide
        Mikhail Khludnev added a comment -

        To reproduce the issue just insert single line in test config by applying dih-cache-threads-enabling-bug.patch. It curiously breaks TestEphemeralCache.

        DocBuilder.doFullDump() considers request as dataImporter.getConfig().isMultiThreaded even the single thread is specified (DataImporter.initEntity() too straightforward). But it forces DocBuilder to choose EntityRunner branch instead of buildDocument().

        EntityRunner destroys entity processor (and cleanups the cache) per every run() call, but for child entity the run() is called many times - once per parent. As result the only first child is indexed, and it fails the test.
        You can fix it by commenting entityProcessor destroying in EntityRunner.run() at the finally section. But you lose cache releasing at all.
        dih-cache-destroy-on-threads-fix.patch fixes entityProcessors destroying and cover it by slightly amended TestEphemeralCache

        Show
        Mikhail Khludnev added a comment - To reproduce the issue just insert single line in test config by applying dih-cache-threads-enabling-bug.patch. It curiously breaks TestEphemeralCache. DocBuilder.doFullDump() considers request as dataImporter.getConfig().isMultiThreaded even the single thread is specified (DataImporter.initEntity() too straightforward). But it forces DocBuilder to choose EntityRunner branch instead of buildDocument(). EntityRunner destroys entity processor (and cleanups the cache) per every run() call, but for child entity the run() is called many times - once per parent. As result the only first child is indexed, and it fails the test. You can fix it by commenting entityProcessor destroying in EntityRunner.run() at the finally section. But you lose cache releasing at all. dih-cache-destroy-on-threads-fix.patch fixes entityProcessors destroying and cover it by slightly amended TestEphemeralCache

          People

          • Assignee:
            James Dyer
            Reporter:
            Mikhail Khludnev
          • Votes:
            3 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development