Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5
    • Labels:
      None

      Description

      It has been a long pending request to make DIH multithreaded. Now that we have implemented most of the features , the next best thing we can aim for is performance. DIH should be able to take advantage of multiple cores in a box .I expect the configuration to be as follows

      <entity name="foo" threads="4">
      <!--more stuff goes here-->
      </entity>
      

      at the entity where the threads is specified it should fork into multiple threads. If the threads<2 it executes w/o forking. In debug mode it automatically becomes singlethreaded.

      1. SOLR-1352.patch
        44 kB
        Noble Paul
      2. SOLR-1352.patch
        26 kB
        Noble Paul
      3. SOLR-1352.patch
        37 kB
        Noble Paul

        Issue Links

          Activity

          Hide
          cmd added a comment -

          Hi Russell Teabeault and my problem was serious:

          <dataConfig>
          <dataSource driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@192.168.0.5:1521:orcl" user="TESTSOLR" password="TESTSOLR" />
          <document name="doc">
          <entity name="A" transformer="ClobTransformer" threads="8" query="select * from A " pk="id">
          <field column="COL1" clob="true" name="COL1" />
          <field column="COL1" clob="true" name="COL2" />

          <entity name="B" query="select * from B WHERE AID='$

          {A.ID}

          ' " pk="id" processor="CachedSqlEntityProcessor">
          <field column="COL3" clob="true" name="COL2" />
          <field column="COL4" clob="true" name="COL4" />
          </entity>
          </entity>
          </document>
          </dataConfig>

          TABLE A:10 million rows
          TABLE B:20 million rows

          the dih processor is very unstable.always throws exception
          "Closed Connection: next"
          please provide some information to me .thanks.

          Show
          cmd added a comment - Hi Russell Teabeault and my problem was serious: <dataConfig> <dataSource driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@192.168.0.5:1521:orcl" user="TESTSOLR" password="TESTSOLR" /> <document name="doc"> <entity name="A" transformer="ClobTransformer" threads="8" query="select * from A " pk="id"> <field column="COL1" clob="true" name="COL1" /> <field column="COL1" clob="true" name="COL2" /> <entity name="B" query="select * from B WHERE AID='$ {A.ID} ' " pk="id" processor="CachedSqlEntityProcessor"> <field column="COL3" clob="true" name="COL2" /> <field column="COL4" clob="true" name="COL4" /> </entity> </entity> </document> </dataConfig> TABLE A:10 million rows TABLE B:20 million rows the dih processor is very unstable.always throws exception "Closed Connection: next" please provide some information to me .thanks.
          Hide
          Noble Paul added a comment -

          you may open another issue, because this is already closed

          Show
          Noble Paul added a comment - you may open another issue, because this is already closed
          Hide
          Russell Teabeault added a comment -

          Ok. The package of SolrQueryResponse changed from org.apache.solr.request to org.apache.solr.response. Because org.apache.solr.response.SolrQueryResponse does not exist in 1.4 I changed the necessary code in the trunk to use org.apache.solr.request.SolrQueryResponse so that I could get everything to compile. I then replaced the dataimporthandler jar file in the 1.4 version of the solr war.

          So with one thread everything works fine. I then set the root entity to use 4 threads. I often get the following exception almost immediately after starting:

          org.apache.solr.handler.dataimport.DataImportHandlerException: java.sql.SQLRecoverableException: Closed Connection: next
          at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
          at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:337)
          at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$600(JdbcDataSource.java:226)
          at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:260)
          at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:75)
          at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
          ...

          However after 5 to 10 attempts it runs without problem. This is being run against an Oracle 10g database using the 11g driver. Also we have 4 sub-entities. Not sure if I should open a new defect for this or if this has been seen by other people? Thoughts?

          Show
          Russell Teabeault added a comment - Ok. The package of SolrQueryResponse changed from org.apache.solr.request to org.apache.solr.response. Because org.apache.solr.response.SolrQueryResponse does not exist in 1.4 I changed the necessary code in the trunk to use org.apache.solr.request.SolrQueryResponse so that I could get everything to compile. I then replaced the dataimporthandler jar file in the 1.4 version of the solr war. So with one thread everything works fine. I then set the root entity to use 4 threads. I often get the following exception almost immediately after starting: org.apache.solr.handler.dataimport.DataImportHandlerException: java.sql.SQLRecoverableException: Closed Connection: next at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:337) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$600(JdbcDataSource.java:226) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:260) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:75) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) ... However after 5 to 10 attempts it runs without problem. This is being run against an Oracle 10g database using the 11g driver. Also we have 4 sub-entities. Not sure if I should open a new defect for this or if this has been seen by other people? Thoughts?
          Hide
          Russell Teabeault added a comment -

          I believe there have been some incompatible changes that have been made and so it is not as easy as just dropping in the DIH jar. I took the 1.4 war and replaced the DIH jar with a current build of the 1.5 DIH jar. When accessing the dataimport.jsp file I got the following error:

          org.apache.solr.handler.RequestHandlerBase.handleRequestBody(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/request/SolrQueryResponse;)V

          java.lang.AbstractMethodError: org.apache.solr.handler.RequestHandlerBase.handleRequestBody(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/request/SolrQueryResponse;)V
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
          ....

          Any ideas? Thanks.

          Show
          Russell Teabeault added a comment - I believe there have been some incompatible changes that have been made and so it is not as easy as just dropping in the DIH jar. I took the 1.4 war and replaced the DIH jar with a current build of the 1.5 DIH jar. When accessing the dataimport.jsp file I got the following error: org.apache.solr.handler.RequestHandlerBase.handleRequestBody(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/request/SolrQueryResponse;)V java.lang.AbstractMethodError: org.apache.solr.handler.RequestHandlerBase.handleRequestBody(Lorg/apache/solr/request/SolrQueryRequest;Lorg/apache/solr/request/SolrQueryResponse;)V at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) .... Any ideas? Thanks.
          Hide
          Noble Paul added a comment -

          y do you need to apply a patch? DIH is a separate .jar file. just take the jar from the latest and replace in your 1.4 release

          Show
          Noble Paul added a comment - y do you need to apply a patch? DIH is a separate .jar file. just take the jar from the latest and replace in your 1.4 release
          Hide
          Russell Teabeault added a comment -

          I was trying to apply the patch against 1.4 and was unsuccessful. Has anyone created a patch for this against 1.4 or could give me some advice about creating one. Thanks.

          Show
          Russell Teabeault added a comment - I was trying to apply the patch against 1.4 and was unsuccessful. Has anyone created a patch for this against 1.4 or could give me some advice about creating one. Thanks.
          Hide
          Alexey Serba added a comment -

          I've tested this feature in trunk on a quite large data set ( freebase WEX wikipedia extract loaded into Oracle database ). It seems to be working ok.

          The only thing I noticed is that multithreaded implementation produce too verbose logging. Could you please change log level to DEBUG for the following events:

          2010-07-16 20:35:50,199 INFO dataimport.ThreadedEntityProcessorWrapper - arow :

          {id=12345, body=blahblah, title=Title}

          2010-07-16 20:35:50,201 INFO dataimport.DocBuilder - a row on docrootSolrInputDocument[{id=id(1.0)=

          {12345}

          , body=body(1.0)=

          {blahblah}

          , title=title(1.0)={Title}}]

          Show
          Alexey Serba added a comment - I've tested this feature in trunk on a quite large data set ( freebase WEX wikipedia extract loaded into Oracle database ). It seems to be working ok. The only thing I noticed is that multithreaded implementation produce too verbose logging. Could you please change log level to DEBUG for the following events: 2010-07-16 20:35:50,199 INFO dataimport.ThreadedEntityProcessorWrapper - arow : {id=12345, body=blahblah, title=Title} 2010-07-16 20:35:50,201 INFO dataimport.DocBuilder - a row on docrootSolrInputDocument[{id=id(1.0)= {12345} , body=body(1.0)= {blahblah} , title=title(1.0)={Title}}]
          Hide
          Noble Paul added a comment -

          the read and write are done by the same thread. but if you have 4 threads 4 documents will be processed in parallel

          Show
          Noble Paul added a comment - the read and write are done by the same thread. but if you have 4 threads 4 documents will be processed in parallel
          Hide
          David Smiley added a comment -

          Without having to read the patch, can someone describe in more detail the nature of DIH multi-threading? I can figure what it would mean to have two threads, one to read from the data provider and one to write to Solr, with a queue in-between. But it's not clear what's going on here since thread can be > 2.

          Show
          David Smiley added a comment - Without having to read the patch, can someone describe in more detail the nature of DIH multi-threading? I can figure what it would mean to have two threads, one to read from the data provider and one to write to Solr, with a queue in-between. But it's not clear what's going on here since thread can be > 2.
          Hide
          Noble Paul added a comment -

          committed Revision: 898209

          Show
          Noble Paul added a comment - committed Revision: 898209
          Hide
          Noble Paul added a comment -

          More or less final. I plan to commit this soon

          Show
          Noble Paul added a comment - More or less final. I plan to commit this soon
          Hide
          Noble Paul added a comment -

          updated to trunk

          Show
          Noble Paul added a comment - updated to trunk
          Hide
          Noble Paul added a comment -

          'numThreads' becomes' threads'

          Show
          Noble Paul added a comment - 'numThreads' becomes' threads'
          Hide
          Noble Paul added a comment -

          first cut an ugly patch. a lot of work left before putting it in

          Show
          Noble Paul added a comment - first cut an ugly patch. a lot of work left before putting it in
          Hide
          Avlesh Singh added a comment -

          Thanks, once again , for creating the ticket, Noble.
          Here's the last discussion on the topic, "Support for batch processing of commands using parallel threads in DIH" - http://www.lucidimagination.com/search/document/a9b26ade46466ee/queries_regarding_a_paralleldataimporthandler

          Show
          Avlesh Singh added a comment - Thanks, once again , for creating the ticket, Noble. Here's the last discussion on the topic, "Support for batch processing of commands using parallel threads in DIH" - http://www.lucidimagination.com/search/document/a9b26ade46466ee/queries_regarding_a_paralleldataimporthandler

            People

            • Assignee:
              Noble Paul
              Reporter:
              Noble Paul
            • Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development