Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Labels:
      None

      Description

      current DIH design is not thread safe. see last comments at SOLR-2382 and SOLR-2947. I'm going to provide the patch makes DIH core threadsafe. Mostly it's a SOLR-2947 patch from 28th Dec.

      1. patch-3011-EntityProcessorBase-iterator.patch
        2 kB
        Wenca Petr
      2. patch-3011-EntityProcessorBase-iterator.patch
        0.9 kB
        Wenca Petr
      3. SOLR-3011.patch
        40 kB
        James Dyer
      4. SOLR-3011.patch
        42 kB
        Mikhail Khludnev
      5. SOLR-3011.patch
        39 kB
        Mikhail Khludnev
      6. SOLR-3011.patch
        28 kB
        Mikhail Khludnev
      7. SOLR-3011.patch
        27 kB
        Mikhail Khludnev

        Activity

        Hide
        Mikhail Khludnev added a comment -

        here is reworked patch SOLR-2947 from 28th Dec most of ideas is explained there. the differences are:

        • SOLR-2947 has been separated
        • I left Context.SCOPE_DOC as-is, but I'm not sure it works in multi-thread mode, will look into later.

        It's a patch on patch - it need to be applied after SOLR-2542, SOLR-2933, SOLR-2947 (here is the pic). Ii suppose you can only review it (which is quite appreciated, please), but not apply. I'm ready to refresh it after those ones will be applied.

        Show
        Mikhail Khludnev added a comment - here is reworked patch SOLR-2947 from 28th Dec most of ideas is explained there. the differences are: SOLR-2947 has been separated I left Context.SCOPE_DOC as-is, but I'm not sure it works in multi-thread mode, will look into later. It's a patch on patch - it need to be applied after SOLR-2542 , SOLR-2933 , SOLR-2947 (here is the pic ). Ii suppose you can only review it (which is quite appreciated, please), but not apply. I'm ready to refresh it after those ones will be applied.
        Hide
        Mikhail Khludnev added a comment -

        Just realized that I considered only cached sql scenario, but didn't look into simple N+1 select mode. I will add multi-threaded tests for it too.

        Show
        Mikhail Khludnev added a comment - Just realized that I considered only cached sql scenario, but didn't look into simple N+1 select mode. I will add multi-threaded tests for it too.
        Hide
        Mikhail Khludnev added a comment -

        Ok. I'm attaching refreshed path for core multithreading DIH issue: SOLR-3011.patch.

        Code

        DataImporter.java

        I added DocBuilder.destroy() to stop thread pool after all work is done. I'm bothered by testCase's warns about "thread leaks"

        DocBuilder.java

        • EntityRunner give up create EntityProcessors and obtains it from constructor args
        • proper destroying of EntityProcessors
        • EntityRunner.docWrapper is removed as not-thread-safe. it's passed explicitly by method arguments
        • EntityRunner.entityEnded was't thread-safe too. moved into ThreadedEntityProcessorWrapper
        • object instantiating was drastically amended to be threadsafe
          • single EntityRunner per Entity
          • single EntityProcessor per EntityRunner
          • N ThreadedEntityProcessorWrapper per EntityRunner uses its' EntityProcessor as delegate
          • where N is number of threads specified at root entity (threads attr is prohibited for child entities)
          • ThreadedEntityProcessorWrapper are numbered by their positions in EntityRunner's tepw list
          • parent entity's ThreadedEntityProcessorWrapper always hits children's tepw with the same number as its' own
        • parent entity's ThreadedEntityProcessorWrapper always hits children's tepw by plain Java synchronous call (w/o thread pool)

        EntityProcessorWrapper.java

        protected transformRow() has been extracted from applyTransformer(). I need to reuse transformers logic for the paged flow but applyTransformer() has side-effect on rowcache field.

        ThreadedEntityProcessorWrapper.java

        in addition to all refactorings above (instantiating and field move). it contains the core idea of multithred cached entity processor:

        • after tepw obtains access to thread-unaware delegate entityProcessor it need to pull whole page - all children rows belong to the current parent roe,
        • whole page is transformed and put into tepw.rowcahce, where they will be pulled later by the parent entity tepw
        • important point is condition which enables the paged mode. I beleve any children entiry should be processed in paged mode. see TEPW.nextRow() var retrieveWholePage

        Tests

        TestThreaded.java

        I've got that this test doesn't cover cached entity processor (where="xid=x.id") and doesn't cover N+1 usage ("... where y.xid=$

        {x.id}

        "). There were single child row per parent. I added both usages with all threads attribute cases.

        TBD

        • I have some suspicions in Context.SCOPE_DOC.
        • even after this patch multithread DIH suffer from SOLR-2961, SOLR-2804. I need this patch applied to unlock them.
        • it's almost impossible to apply on 3.5. Whole SOLR-2382 with fixes should be ported before.

        Thanks

        Show
        Mikhail Khludnev added a comment - Ok. I'm attaching refreshed path for core multithreading DIH issue: SOLR-3011 .patch. Code DataImporter.java I added DocBuilder.destroy() to stop thread pool after all work is done. I'm bothered by testCase's warns about "thread leaks" DocBuilder.java EntityRunner give up create EntityProcessors and obtains it from constructor args proper destroying of EntityProcessors EntityRunner.docWrapper is removed as not-thread-safe. it's passed explicitly by method arguments EntityRunner.entityEnded was't thread-safe too. moved into ThreadedEntityProcessorWrapper object instantiating was drastically amended to be threadsafe single EntityRunner per Entity single EntityProcessor per EntityRunner N ThreadedEntityProcessorWrapper per EntityRunner uses its' EntityProcessor as delegate where N is number of threads specified at root entity (threads attr is prohibited for child entities) ThreadedEntityProcessorWrapper are numbered by their positions in EntityRunner's tepw list parent entity's ThreadedEntityProcessorWrapper always hits children's tepw with the same number as its' own parent entity's ThreadedEntityProcessorWrapper always hits children's tepw by plain Java synchronous call (w/o thread pool) EntityProcessorWrapper.java protected transformRow() has been extracted from applyTransformer(). I need to reuse transformers logic for the paged flow but applyTransformer() has side-effect on rowcache field. ThreadedEntityProcessorWrapper.java in addition to all refactorings above (instantiating and field move). it contains the core idea of multithred cached entity processor: after tepw obtains access to thread-unaware delegate entityProcessor it need to pull whole page - all children rows belong to the current parent roe, whole page is transformed and put into tepw.rowcahce, where they will be pulled later by the parent entity tepw important point is condition which enables the paged mode. I beleve any children entiry should be processed in paged mode. see TEPW.nextRow() var retrieveWholePage Tests TestThreaded.java I've got that this test doesn't cover cached entity processor (where="xid=x.id") and doesn't cover N+1 usage ("... where y.xid=$ {x.id} "). There were single child row per parent. I added both usages with all threads attribute cases. TBD I have some suspicions in Context.SCOPE_DOC. even after this patch multithread DIH suffer from SOLR-2961 , SOLR-2804 . I need this patch applied to unlock them. it's almost impossible to apply on 3.5. Whole SOLR-2382 with fixes should be ported before. Thanks
        Hide
        James Dyer added a comment -

        Mikhail,

        I am very interested in getting the DIH threading bugs fixed. However, it might be a few weeks until I'll have time to give this issue&patch the time it deserves. Unless someone beats me to this, I will gladly work with you on getting these fixes committed.

        Show
        James Dyer added a comment - Mikhail, I am very interested in getting the DIH threading bugs fixed. However, it might be a few weeks until I'll have time to give this issue&patch the time it deserves. Unless someone beats me to this, I will gladly work with you on getting these fixes committed.
        Hide
        Wenca Petr added a comment -

        Hi, I've just applied this patch and it solved my problem with multithreaded indexing from sql using berkeley backed cache, which was opened x times (for each thread) bud closed only by one thread, so it remained opened. After the path, the cache is opened only once and properly closed but each thread seems to index all documents. If I have 5000 documents and 4 threads then full import say: "Added/Updated: 20000 documents".

        Show
        Wenca Petr added a comment - Hi, I've just applied this patch and it solved my problem with multithreaded indexing from sql using berkeley backed cache, which was opened x times (for each thread) bud closed only by one thread, so it remained opened. After the path, the cache is opened only once and properly closed but each thread seems to index all documents. If I have 5000 documents and 4 threads then full import say: "Added/Updated: 20000 documents".
        Hide
        Mikhail Khludnev added a comment -

        Petr,

        Your feedback is quite appreciated.
        How much your full indexing time is reduced after multythreading is enabled?
        Pls be informed that you are under risk of SOLR-2804.

        Show
        Mikhail Khludnev added a comment - Petr, Your feedback is quite appreciated. How much your full indexing time is reduced after multythreading is enabled? Pls be informed that you are under risk of SOLR-2804 .
        Hide
        Wenca Petr added a comment -

        Hi Mikhail,
        I know about 2804, I solved it by disabling logging as someone adviced (I think).

        Without multithreading a was able to index about 15k documents per minute, with 4 threads average about 45k per minute. After applying your patch it seems to me that it fell to 30k per minute. But the number of processed documents is wrong. I have 50000 documents to be indexed. I start a full dump, it precesses about 44k documents during the first minute, but it continues after 50k to total 200k of processed with decreasing number of docs per minute with total time of more than 7 minutes. After the commit the index contains 50k documents which is right.

        Show
        Wenca Petr added a comment - Hi Mikhail, I know about 2804, I solved it by disabling logging as someone adviced (I think). Without multithreading a was able to index about 15k documents per minute, with 4 threads average about 45k per minute. After applying your patch it seems to me that it fell to 30k per minute. But the number of processed documents is wrong. I have 50000 documents to be indexed. I start a full dump, it precesses about 44k documents during the first minute, but it continues after 50k to total 200k of processed with decreasing number of docs per minute with total time of more than 7 minutes. After the commit the index contains 50k documents which is right.
        Hide
        Wenca Petr added a comment - - edited

        Well, I solved it. The problem was in EntityProcessorBase.getNext() method which set the rowIterator to null when it reached the end of data. Then the next thread called nextRow() on the SqlEntityProcessor instance where the code is:

        if (rowIterator == null) {
            String q = getQuery();
            initQuery(context.replaceTokens(q));
        }
        

        So it reinitialized the query and created new row iterator from the beginning.
        I am attaching a simple patch.

        Show
        Wenca Petr added a comment - - edited Well, I solved it. The problem was in EntityProcessorBase.getNext() method which set the rowIterator to null when it reached the end of data. Then the next thread called nextRow() on the SqlEntityProcessor instance where the code is: if (rowIterator == null ) { String q = getQuery(); initQuery(context.replaceTokens(q)); } So it reinitialized the query and created new row iterator from the beginning. I am attaching a simple patch.
        Hide
        Mikhail Khludnev added a comment -

        Petr,

        Set rowIterator to null makes sense - no doubts. But I want to add unit test for this particular case before this one line fix. Can you help me better understand the issue, which you faced? Right now I can't get what the problem was.

        Thanks in advance.

        Show
        Mikhail Khludnev added a comment - Petr, Set rowIterator to null makes sense - no doubts. But I want to add unit test for this particular case before this one line fix. Can you help me better understand the issue, which you faced? Right now I can't get what the problem was. Thanks in advance.
        Hide
        Wenca Petr added a comment -

        The problem was that the first thread that reached the end of data set the iterator to null and when other thread called SqlEntityProcessor.nextRow() then all the data to be indexed were loaded again and all the documents were analysed again because SqlEntityProcessor.nextRow() tests the iterator to be null (assuming that null means the beginning of the process). By the change a made all the threads then receive null from the iterator, so they know that there are no more data. Is this explanation ok?

        Show
        Wenca Petr added a comment - The problem was that the first thread that reached the end of data set the iterator to null and when other thread called SqlEntityProcessor.nextRow() then all the data to be indexed were loaded again and all the documents were analysed again because SqlEntityProcessor.nextRow() tests the iterator to be null (assuming that null means the beginning of the process). By the change a made all the threads then receive null from the iterator, so they know that there are no more data. Is this explanation ok?
        Hide
        Mikhail Khludnev added a comment -

        Ok. I've got what the problem is. But it's much more complicate than I could imagine. MockDataSource which is used for cover all DIH stores an iterators, not a collections as a resultsets. So, you retrieve such result set as an iterator, exhaust it, than you can retrieve the same exhausted resultset again and get that it's empty. That's not a case for the real SQL datasource, it works as a collection: every time you get the brand new resultset. After I amend MockDataSource to provide collections, most of tests fail. As far as I understand your particular fix doesn't work for parent-child case. I'm on the long road for providing a fix.

        Thanks for spotting it. It's absolutely useful.

        Show
        Mikhail Khludnev added a comment - Ok. I've got what the problem is. But it's much more complicate than I could imagine. MockDataSource which is used for cover all DIH stores an iterators, not a collections as a resultsets. So, you retrieve such result set as an iterator, exhaust it, than you can retrieve the same exhausted resultset again and get that it's empty. That's not a case for the real SQL datasource, it works as a collection: every time you get the brand new resultset. After I amend MockDataSource to provide collections, most of tests fail. As far as I understand your particular fix doesn't work for parent-child case. I'm on the long road for providing a fix. Thanks for spotting it. It's absolutely useful.
        Hide
        Wenca Petr added a comment -

        You are right. I found it also trying to index my data with parent-child entities. I think that the best way would be to provide some flag that the entity is/is not done and the flag would be tested when a particular row iterator reaches the end. I made another quick and dirty fix . I set the iterator to null in getNext() just for non root entities. For root entity the iterator is left as in my previous patch.

        Show
        Wenca Petr added a comment - You are right. I found it also trying to index my data with parent-child entities. I think that the best way would be to provide some flag that the entity is/is not done and the flag would be tested when a particular row iterator reaches the end. I made another quick and dirty fix . I set the iterator to null in getNext() just for non root entities. For root entity the iterator is left as in my previous patch.
        Hide
        James Dyer added a comment -

        I'm just starting to look at the DIH threaded code and the history behind it (SOLR-1352, etc). It seems like a lot of work has been put into this by Noble, Shalin, you and others. Yet, I can't help but wonder if we've gone down a bad path with this one. That is, DIH is essentially a collection of loosely-coupled components: DataSource, Entity, Transformer, etc. So it seems that for this to work, not only does the core (DocBuilder etc) need to be thread-safe, but every component in a given DIH configuration needs to be also.

        There also is quite a bit of code duplication in DocBuilder and classes like ThreadedEntityProcessorWrapper for threaded configurations. In the past, this seems to have caused threaded-only problems like SOLR-2668. Also, the DocBuilder duplication only covers full-imports. The delta-import doesn't look like it supports threading at all? Finally, users can get confused because specifying threads="1" actually does something: it puts the whole import through the threaded code branches instead of the single-thread code branches.

        Then there is the issue of tests. Mikhail, you've just noticed that MockDataSource was not designed to test a multi-threaded scenario in a valid fashion. But I'm not sure even the tests that are just for threads are all that valid. Take a look at TestDocBuilderThreaded. Most of the configurations have "threads=1". And what about this comment:

        <code>
        /*

        • This test fails in TestEnviroment, but works in real Live
          */
          @Test
          public void testEvaluator() throws Exception
          <code>

        I'm starting to worry we're playing wack-a-mole with this, and maybe we need a different approach. What if we tried this as a path forward:

        1. Keep 3.x as-is, and make any quick fixes to threads for common use-cases there, as possible.
        2. In 4.0 (or a separate branch), remove threading from DIH.
        3. Refactor code and improve tests.
        4. Make DocBuilder, etc threadsafe.
        5. Create a marker interface or annotation that can be put on DataSources, Entities, Transformers, SolrWriters, etc signifying they can be used safely with threads.
        6. Re-implement the "threads" parameter. Maybe be a bit more thoughtful about how it will work & document it. Do we have it be a thread-per-document (like we have now) or a thread-per-entity (run child threads in parallel, but the root entity is single-threaded)? Can we design it so that both can be supported?
        7. One-by-one, make the DataSources, Entities, etc threadsafe and implement the marker interface, the new annotation, etc.

        I realize that #1-2 present a problem with what has been done here already. The SOLR-3011 patches work on 4.x and it would be a lot of work to backport 3.x. Yet I am proposing removing the current threading entirely from 4.x and "fixing" only 3.x. But I can probably help with porting (some of?) this patch back to 3.x.

        We recently had this comment come from one of our PMC members: "If we would have a better alternative without locale, threading and unicode bugs, I would svn rm." But what I'm seeing is that while Lucene and Solr-core have had a lot of hands in refactoring and improving the code, DIH has only had features piled up onto it. It was mostly written by 1 or 2 people, with limited community support from a code-maintenance perspective. In short, it hasn't gotten the TLC it needs to thrive long-term. Maybe now's the time.

        Comments? Does this seem like a viable way forward?

        Show
        James Dyer added a comment - I'm just starting to look at the DIH threaded code and the history behind it ( SOLR-1352 , etc). It seems like a lot of work has been put into this by Noble, Shalin, you and others. Yet, I can't help but wonder if we've gone down a bad path with this one. That is, DIH is essentially a collection of loosely-coupled components: DataSource, Entity, Transformer, etc. So it seems that for this to work, not only does the core (DocBuilder etc) need to be thread-safe, but every component in a given DIH configuration needs to be also. There also is quite a bit of code duplication in DocBuilder and classes like ThreadedEntityProcessorWrapper for threaded configurations. In the past, this seems to have caused threaded-only problems like SOLR-2668 . Also, the DocBuilder duplication only covers full-imports. The delta-import doesn't look like it supports threading at all? Finally, users can get confused because specifying threads="1" actually does something: it puts the whole import through the threaded code branches instead of the single-thread code branches. Then there is the issue of tests. Mikhail, you've just noticed that MockDataSource was not designed to test a multi-threaded scenario in a valid fashion. But I'm not sure even the tests that are just for threads are all that valid. Take a look at TestDocBuilderThreaded. Most of the configurations have "threads=1". And what about this comment: <code> /* This test fails in TestEnviroment, but works in real Live */ @Test public void testEvaluator() throws Exception <code> I'm starting to worry we're playing wack-a-mole with this, and maybe we need a different approach. What if we tried this as a path forward: 1. Keep 3.x as-is, and make any quick fixes to threads for common use-cases there, as possible. 2. In 4.0 (or a separate branch), remove threading from DIH. 3. Refactor code and improve tests. 4. Make DocBuilder, etc threadsafe. 5. Create a marker interface or annotation that can be put on DataSources, Entities, Transformers, SolrWriters, etc signifying they can be used safely with threads. 6. Re-implement the "threads" parameter. Maybe be a bit more thoughtful about how it will work & document it. Do we have it be a thread-per-document (like we have now) or a thread-per-entity (run child threads in parallel, but the root entity is single-threaded)? Can we design it so that both can be supported? 7. One-by-one, make the DataSources, Entities, etc threadsafe and implement the marker interface, the new annotation, etc. I realize that #1-2 present a problem with what has been done here already. The SOLR-3011 patches work on 4.x and it would be a lot of work to backport 3.x. Yet I am proposing removing the current threading entirely from 4.x and "fixing" only 3.x. But I can probably help with porting (some of?) this patch back to 3.x. We recently had this comment come from one of our PMC members: "If we would have a better alternative without locale, threading and unicode bugs, I would svn rm." But what I'm seeing is that while Lucene and Solr-core have had a lot of hands in refactoring and improving the code, DIH has only had features piled up onto it. It was mostly written by 1 or 2 people, with limited community support from a code-maintenance perspective. In short, it hasn't gotten the TLC it needs to thrive long-term. Maybe now's the time. Comments? Does this seem like a viable way forward?
        Hide
        Mikhail Khludnev added a comment - - edited

        James,

        So it seems that for this to work, not only does the core (DocBuilder etc) need to be thread-safe, but every component in a given DIH configuration needs to be also.

        For me it's doubtful statement. I believe that it's possible to have bunch of threadUnsafe classes synchronized by some smart orchestrator.

        There also is quite a bit of code duplication in DocBuilder and classes

        Yep. Agree, ThrdEPWrapper is a FullImport only DocBuilder code dupe.

        Mikhail, you've just noticed that MockDataSource was not designed to test a multi-threaded scenario in a valid fashion.

        not really, they just an odd mocks. With real DS every time you get a full resulset from the beginning, but after you reach eof in MockDS's resultset, re-querying gets you the same eof.

        Take a look at TestDocBuilderThreaded.

        I've never seen it actually.

        1. Keep 3.x as-is, and make any quick fixes to threads for common use-cases there, as possible.

        No any quick fixes for any "common" use-cases is possible. I'm sure.

        2. In 4.0 (or a separate branch), remove threading from DIH.

        I suggest an opposite way:

        • be honest with users and remove "threads" from 3.6. Zero impact here. Nobody use it. It just doesn't work.
        • as well I already spend enormous efforts for fixing in it 4.0. I hope I will complete the fix anyway. (it will live at github at least). Btw, the reason why I fix 4.0 is SOLR-2382. Actually I wait sometime before it was commited.

        4. Make DocBuilder, etc threadsafe. 5. Create a marker interface or annotation

        I don't see how it's possible and be really helpful.

        The SOLR-3011 patches work on 4.x .. But I can probably help with porting (some of?) this patch back to 3.x.

        Petr found a case where the patch doesn't work. After (if) I've done it, all commits around SOLR-2382 can be cherrypicked to 3.x. Porting fix w/o DIHCacheSupport will take more time.

        In parallel with my proposals above, I think we really need to start a design of new Ultimate DIH. I propose

        1. to pick up usecases (you are experienced in extreme caching, I did a throughput maximization via async producer-consumer, Peter will give us his cases, etc)
        2. sketch a design in plant uml, check that it's bulletproof
        3. cut it onto pieces, scrum by crowd

        Btw, isn't there something like DIH, maybe we can just reuse some other OSS tool, or library instead of write it ourselves. Some time ago I've heard about something like Kettle. Don't really know what it is.

        Show
        Mikhail Khludnev added a comment - - edited James, So it seems that for this to work, not only does the core (DocBuilder etc) need to be thread-safe, but every component in a given DIH configuration needs to be also. For me it's doubtful statement. I believe that it's possible to have bunch of threadUnsafe classes synchronized by some smart orchestrator. There also is quite a bit of code duplication in DocBuilder and classes Yep. Agree, ThrdEPWrapper is a FullImport only DocBuilder code dupe. Mikhail, you've just noticed that MockDataSource was not designed to test a multi-threaded scenario in a valid fashion. not really, they just an odd mocks. With real DS every time you get a full resulset from the beginning, but after you reach eof in MockDS's resultset, re-querying gets you the same eof. Take a look at TestDocBuilderThreaded. I've never seen it actually. 1. Keep 3.x as-is, and make any quick fixes to threads for common use-cases there, as possible. No any quick fixes for any "common" use-cases is possible. I'm sure. 2. In 4.0 (or a separate branch), remove threading from DIH. I suggest an opposite way: be honest with users and remove "threads" from 3.6. Zero impact here. Nobody use it. It just doesn't work. as well I already spend enormous efforts for fixing in it 4.0. I hope I will complete the fix anyway. (it will live at github at least). Btw, the reason why I fix 4.0 is SOLR-2382 . Actually I wait sometime before it was commited. 4. Make DocBuilder, etc threadsafe. 5. Create a marker interface or annotation I don't see how it's possible and be really helpful. The SOLR-3011 patches work on 4.x .. But I can probably help with porting (some of?) this patch back to 3.x. Petr found a case where the patch doesn't work. After (if) I've done it, all commits around SOLR-2382 can be cherrypicked to 3.x. Porting fix w/o DIHCacheSupport will take more time. In parallel with my proposals above, I think we really need to start a design of new Ultimate DIH. I propose to pick up usecases (you are experienced in extreme caching, I did a throughput maximization via async producer-consumer, Peter will give us his cases, etc) sketch a design in plant uml, check that it's bulletproof cut it onto pieces, scrum by crowd Btw, isn't there something like DIH, maybe we can just reuse some other OSS tool, or library instead of write it ourselves. Some time ago I've heard about something like Kettle. Don't really know what it is.
        Hide
        Mikhail Khludnev added a comment -

        OK. next attempt.

        Core point is cover case shed by Petr: requerying a datasource. I amended MockDatasource to use collection intead of extraditable iterators. It reveals two things: incorrect entityEnded flag - I killed it. Also cache and entityProcessor has wrong two statelogic: non-init/init. I implemented more correct three sate FSM: non-init, iterating, eof. Also I had to push code from ThreadedEPW into plain EPW. AFAIK overhead from redundant synchronize is misarable.

        delta between patches for fast review https://github.com/m-khl/solr-patches/commit/eb49a0024fc6e48d5d19fd784161fc7d331a3150

        Petr, can't you check that it works. You can take the applied patch from https://github.com/m-khl/solr-patches/tree/solr3011

        Show
        Mikhail Khludnev added a comment - OK. next attempt. Core point is cover case shed by Petr: requerying a datasource. I amended MockDatasource to use collection intead of extraditable iterators. It reveals two things: incorrect entityEnded flag - I killed it. Also cache and entityProcessor has wrong two statelogic: non-init/init. I implemented more correct three sate FSM: non-init, iterating, eof. Also I had to push code from ThreadedEPW into plain EPW. AFAIK overhead from redundant synchronize is misarable. delta between patches for fast review https://github.com/m-khl/solr-patches/commit/eb49a0024fc6e48d5d19fd784161fc7d331a3150 Petr, can't you check that it works. You can take the applied patch from https://github.com/m-khl/solr-patches/tree/solr3011
        Hide
        Mikhail Khludnev added a comment -

        Hold on. I verified my patch on my favourite one TestThreaded. But I just run tests through dataimport package - massive sporadic failures are there. Nightmare. I'm nearly to give up.

        Show
        Mikhail Khludnev added a comment - Hold on. I verified my patch on my favourite one TestThreaded. But I just run tests through dataimport package - massive sporadic failures are there. Nightmare. I'm nearly to give up.
        Hide
        Mikhail Khludnev added a comment -

        ok. Massive sporadic test failures has been fixed. The core problem which I've faced is correct EntityProcessor initialization, it should be done under delegate EntityProcessor monitor. The proposed solution breaks TestDocBuilderThreaded.testProcessor2EntitiesNoThreads() I ignored them. Don't think it's a problem.

        But now I much more agree with James. It's hard to maintain code ever.

        James,
        btw what is the importance of delta import scenario? I see it can be done via full import instead http://wiki.apache.org/solr/DataImportHandlerFaq#My_delta-import_goes_out_of_memory_._Any_workaround_.3F Does it make sense to maintain separate code path for them?

        Show
        Mikhail Khludnev added a comment - ok. Massive sporadic test failures has been fixed. The core problem which I've faced is correct EntityProcessor initialization, it should be done under delegate EntityProcessor monitor. The proposed solution breaks TestDocBuilderThreaded.testProcessor2EntitiesNoThreads() I ignored them. Don't think it's a problem. But now I much more agree with James. It's hard to maintain code ever. James, btw what is the importance of delta import scenario? I see it can be done via full import instead http://wiki.apache.org/solr/DataImportHandlerFaq#My_delta-import_goes_out_of_memory_._Any_workaround_.3F Does it make sense to maintain separate code path for them?
        Hide
        James Dyer added a comment -

        what is the importance of delta import scenario? I see it can be done via full import instead

        This just provides two ways you can set up your deltas. Personally I like to use "full-import w/ clean=false" to do deltas. Its much more flexible. With that said, people use the other way and it works. The bad thing, I think, is it is not documented that "threads" doesn't work with it. Also, surely the code can be cleaned up to not be an entirely different branch from "full-import".

        It's hard to maintain code ever.

        This is why I think removing "threads" in trunk is still the way to go. This will give us a good starting point for improving the code. We can add add a better-designed version of "threads" later.

        With that said, I think you've probably fixed "threads" for a lot of use-cases. Why can't we back-port this to 3.x and document for the users what works and doesn't work with "threads", with warnings to test thoroughly before going to production? If it means back-porting the cache refactorings first, so be it.

        I know you were thinking we really should start over with "Ultimate DIH". That's fine if people want to do that. But I'm using the existing DIH for some pretty complex things and it works great. My issue with DIH is not that it isn't a good product. Its just that it needs some work internally if we're going to be able to continue to improve it from here.

        Show
        James Dyer added a comment - what is the importance of delta import scenario? I see it can be done via full import instead This just provides two ways you can set up your deltas. Personally I like to use "full-import w/ clean=false" to do deltas. Its much more flexible. With that said, people use the other way and it works. The bad thing, I think, is it is not documented that "threads" doesn't work with it. Also, surely the code can be cleaned up to not be an entirely different branch from "full-import". It's hard to maintain code ever. This is why I think removing "threads" in trunk is still the way to go. This will give us a good starting point for improving the code. We can add add a better-designed version of "threads" later. With that said, I think you've probably fixed "threads" for a lot of use-cases. Why can't we back-port this to 3.x and document for the users what works and doesn't work with "threads", with warnings to test thoroughly before going to production? If it means back-porting the cache refactorings first, so be it. I know you were thinking we really should start over with "Ultimate DIH". That's fine if people want to do that. But I'm using the existing DIH for some pretty complex things and it works great. My issue with DIH is not that it isn't a good product. Its just that it needs some work internally if we're going to be able to continue to improve it from here.
        Hide
        Mikhail Khludnev added a comment -

        If it means back-porting the cache refactorings first, so be it.

        ah, I've got your point. sure. In this case it will be vanilla cherrypicking

        Show
        Mikhail Khludnev added a comment - If it means back-porting the cache refactorings first, so be it. ah, I've got your point. sure. In this case it will be vanilla cherrypicking
        Hide
        James Dyer added a comment -

        Changing fix version for 3.6. For trunk, I would like to remove the "threads" feature.

        Show
        James Dyer added a comment - Changing fix version for 3.6. For trunk, I would like to remove the "threads" feature.
        Hide
        James Dyer added a comment -

        Here is a cleaned-up version of the last patch.

        • simplified "TestThreaded".
        • Added a logged deprecation warning that "threads" will be removed in a future release.
        • ran the DIH tests a few times and everything passed.

        This I will commit shortly to the 3.x branch.

        Show
        James Dyer added a comment - Here is a cleaned-up version of the last patch. simplified "TestThreaded". Added a logged deprecation warning that "threads" will be removed in a future release. ran the DIH tests a few times and everything passed. This I will commit shortly to the 3.x branch.
        Hide
        Mikhail Khludnev added a comment -

        James,

        I'm glad to hear it. Let me know if you like me to refresh patches at SOLR-2961 and SOLR-2804. They are also blockers for using "threads".

        Show
        Mikhail Khludnev added a comment - James, I'm glad to hear it. Let me know if you like me to refresh patches at SOLR-2961 and SOLR-2804 . They are also blockers for using "threads".
        Hide
        James Dyer added a comment -

        That would be great if you can. Lucene/Solr 3.6 is going to be the last 3.x release and it is closing for new functionality soon. SOLR-2804 for sure looks like something that should be there. Is SOLR-2961 just for Tika?

        Show
        James Dyer added a comment - That would be great if you can. Lucene/Solr 3.6 is going to be the last 3.x release and it is closing for new functionality soon. SOLR-2804 for sure looks like something that should be there. Is SOLR-2961 just for Tika?
        Hide
        Mikhail Khludnev added a comment -

        Is SOLR-2961 just for Tika?

        yep. it seems so. Why do you ask, we don't need to support it further?

        Show
        Mikhail Khludnev added a comment - Is SOLR-2961 just for Tika? yep. it seems so. Why do you ask, we don't need to support it further?
        Hide
        James Dyer added a comment -

        Committed branch_3x (only): r1303949

        Thank you Mikhail! I realize this took a lot of patience and unforgiving work on your part.

        Show
        James Dyer added a comment - Committed branch_3x (only): r1303949 Thank you Mikhail! I realize this took a lot of patience and unforgiving work on your part.
        Hide
        Bernd Fehling added a comment - - edited

        Just tried multi-threaded. It produces the required number of threads (seen in debugger) but only runs once.
        My configuration is:

        <dataConfig>
                <dataSource name="filetraverser" type="FileDataSource" encoding="UTF-8" />
                <document>
                <entity name="basedata" processor="FileListEntityProcessor"
                        threads="4"
                        rootEntity="false"
                        fileName="\.xml$"
                        recursive="true"
                        dataSource="null"
                        baseDir="/srv/www/solr/DATA/OAI"
                        >
                        <entity name="records" processor="XPathEntityProcessor"
                                threads="4"
                                rootEntity="true"
                                dataSource="filetraverser"
                                stream="true"
                                forEach="/documents/document"
                                url="${basedata.fileAbsolutePath}"
                                >
                                <field column="id"      xpath="/documents/document/@id" />
                                <field column="dctitle" xpath="/documents/document/element[@name='dctitle']/value" />
                        </entity>
                </entity>
                </document>
        </dataConfig>
        

        It should read all files below baseDir and build documents from the records inside the files.
        Works fine in non-multi-threaded but only reads the first file in multi-threaded mode.
        Any idea?

        And another thing to mention, in TestThreaded.java there are the lines:

          @Test
          public void testCachedThreadless_FullImport() throws Exception {
            runFullImport(getCachedConfig(random.nextBoolean(), random.nextBoolean(), 0));
          }
          
          @Test
          public void testCachedSingleThread_FullImport() throws Exception {
            runFullImport(getCachedConfig(random.nextBoolean(), random.nextBoolean(), 1));
          }
          
          @Test
          public void testCachedThread_FullImport() throws Exception {
            int numThreads = random.nextInt(9) + 1; // between one and 10
            String config = getCachedConfig(random.nextBoolean(), random.nextBoolean(), numThreads);
            runFullImport(config);
          }
        

        This will test 0, 1 and random between 1 to 9. But 1 is already covered.
        So wouldn't it be better to have "random.nextInt(8) + 2" for the range 2 to 9?

        Show
        Bernd Fehling added a comment - - edited Just tried multi-threaded. It produces the required number of threads (seen in debugger) but only runs once. My configuration is: <dataConfig> <dataSource name= "filetraverser" type= "FileDataSource" encoding= "UTF-8" /> <document> <entity name= "basedata" processor= "FileListEntityProcessor" threads= "4" rootEntity= "false" fileName= "\.xml$" recursive= "true" dataSource= "null" baseDir= "/srv/www/solr/DATA/OAI" > <entity name= "records" processor= "XPathEntityProcessor" threads= "4" rootEntity= "true" dataSource= "filetraverser" stream= "true" forEach= "/documents/document" url= "${basedata.fileAbsolutePath}" > <field column= "id" xpath= "/documents/document/@id" /> <field column= "dctitle" xpath= "/documents/document/element[@name='dctitle']/value" /> </entity> </entity> </document> </dataConfig> It should read all files below baseDir and build documents from the records inside the files. Works fine in non-multi-threaded but only reads the first file in multi-threaded mode. Any idea? And another thing to mention, in TestThreaded.java there are the lines: @Test public void testCachedThreadless_FullImport() throws Exception { runFullImport(getCachedConfig(random.nextBoolean(), random.nextBoolean(), 0)); } @Test public void testCachedSingleThread_FullImport() throws Exception { runFullImport(getCachedConfig(random.nextBoolean(), random.nextBoolean(), 1)); } @Test public void testCachedThread_FullImport() throws Exception { int numThreads = random.nextInt(9) + 1; // between one and 10 String config = getCachedConfig(random.nextBoolean(), random.nextBoolean(), numThreads); runFullImport(config); } This will test 0, 1 and random between 1 to 9. But 1 is already covered. So wouldn't it be better to have "random.nextInt(8) + 2" for the range 2 to 9?
        Hide
        James Dyer added a comment -

        Bernd,

        Do you think this is a new bug since SOLR-3011 was applied to the 3.6 branch, or does this also fail this way for you with 3.5? Also, is there any way you can provide a failing unit test? If so, open a new issue and attach your unit test in a patch. As you might be aware, "threads" is removed from 4.0, mostly because of bugs like this one. I'd be interested in getting this fixed in the 3.x branch especially if this is newly-caused by SOLR-3011.

        You're right that the random test is redundant for the 1-thread case. No harm in what's there, but it would be better if the random test did 2-10, not 1-10.

        Show
        James Dyer added a comment - Bernd, Do you think this is a new bug since SOLR-3011 was applied to the 3.6 branch, or does this also fail this way for you with 3.5? Also, is there any way you can provide a failing unit test? If so, open a new issue and attach your unit test in a patch. As you might be aware, "threads" is removed from 4.0, mostly because of bugs like this one. I'd be interested in getting this fixed in the 3.x branch especially if this is newly-caused by SOLR-3011 . You're right that the random test is redundant for the 1-thread case. No harm in what's there, but it would be better if the random test did 2-10, not 1-10.
        Hide
        Bernd Fehling added a comment -

        It works with 3.5 but after applying the SOLR-3011 patch it fails.
        I know that "threads" will be removed with 4.0 but nevertheless it is helpful when loading tons of data on a multi CPU machine.
        I try to provide a unit test as new issue with reference to this issue.

        Stupid question, if "threads" will be removed with 4.0 does this mean that the redesign of dataimport is multi-threading by default so that no "threads" parameter is needed, or will it be sized down to single-thread?

        Show
        Bernd Fehling added a comment - It works with 3.5 but after applying the SOLR-3011 patch it fails. I know that "threads" will be removed with 4.0 but nevertheless it is helpful when loading tons of data on a multi CPU machine. I try to provide a unit test as new issue with reference to this issue. Stupid question, if "threads" will be removed with 4.0 does this mean that the redesign of dataimport is multi-threading by default so that no "threads" parameter is needed, or will it be sized down to single-thread?
        Hide
        James Dyer added a comment -

        If the changes in 3.6 break FileListEntityProcessor then we should try and fix it. A failing unit test would help a lot. As a workaround, you should always be able to use the 3.5 jar with 3.6.

        4.0 is only going to support single-threaded DIH configurations. I understand that some users have gotten performance gains using "threads" and haven't had problems. I suspect these were mostly cases like yours where you're processing text documents and have a somewhat simple configuration. But looking at the code, I don't think we can guarantee DIH using the "threads" parameter will never encounter a race condition, etc, and that some configurations (especially using SQL, caching, etc) were not working at all (which SOLR-3011 at least mostly fixes). It was also getting hard to add new features because all bets were pretty much off as to whether or not any changes would work with "threads".

        Long term, I would like to see some type of multi-threading added back in. But we do need to refactor the code. I am looking now in trying to consolidate some of the objects that DIH passes around, reducing member visibility, making things immutable, etc. Some of the classes need to be made simpler (DocBuilder comes to mind). Hopefully we can have a code base that can be more easily made threadsafe in the future.

        Show
        James Dyer added a comment - If the changes in 3.6 break FileListEntityProcessor then we should try and fix it. A failing unit test would help a lot. As a workaround, you should always be able to use the 3.5 jar with 3.6. 4.0 is only going to support single-threaded DIH configurations. I understand that some users have gotten performance gains using "threads" and haven't had problems. I suspect these were mostly cases like yours where you're processing text documents and have a somewhat simple configuration. But looking at the code, I don't think we can guarantee DIH using the "threads" parameter will never encounter a race condition, etc, and that some configurations (especially using SQL, caching, etc) were not working at all (which SOLR-3011 at least mostly fixes). It was also getting hard to add new features because all bets were pretty much off as to whether or not any changes would work with "threads". Long term, I would like to see some type of multi-threading added back in. But we do need to refactor the code. I am looking now in trying to consolidate some of the objects that DIH passes around, reducing member visibility, making things immutable, etc. Some of the classes need to be made simpler (DocBuilder comes to mind). Hopefully we can have a code base that can be more easily made threadsafe in the future.

          People

          • Assignee:
            James Dyer
            Reporter:
            Mikhail Khludnev
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development