Solr
  1. Solr
  2. SOLR-2186

DataImportHandler multi-threaded option throws exception

    Details

      Description

      The multi-threaded option for the DataImportHandler throws an exception and the entire operation fails. This is true even if only 1 thread is configured via threads='1'

      1. SOLR-2186.patch
        5 kB
        Shalin Shekhar Mangar
      2. TestTikaEntityProcessor.patch
        4 kB
        Frank Wesemann
      3. SOLR-2186.patch
        2 kB
        Frank Wesemann
      4. TestDocBuilderThreaded.java
        7 kB
        Frank Wesemann
      5. SOLR-2186.patch
        2 kB
        Frank Wesemann
      6. Solr-2186.patch
        1 kB
        Frank Wesemann
      7. TikaResolver.patch
        4 kB
        Lance Norskog

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          bulk close for 3.4

          Show
          Robert Muir added a comment - bulk close for 3.4
          Hide
          Shalin Shekhar Mangar added a comment -

          Frank, I've opened SOLR-2655 for related issues. I may not have time to go into these soon so I'd advise people not to use multi threaded mode for the time being.

          Show
          Shalin Shekhar Mangar added a comment - Frank, I've opened SOLR-2655 for related issues. I may not have time to go into these soon so I'd advise people not to use multi threaded mode for the time being.
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed revision 1147023 on trunk and 1147033 on branch_3x.

          Thanks!

          Show
          Shalin Shekhar Mangar added a comment - Committed revision 1147023 on trunk and 1147033 on branch_3x. Thanks!
          Hide
          Frank Wesemann added a comment -

          Thanks for taking this issue Shalin.
          You might close SOLR-2544 along with this

          Show
          Frank Wesemann added a comment - Thanks for taking this issue Shalin. You might close SOLR-2544 along with this
          Hide
          Shalin Shekhar Mangar added a comment -

          A patch which fixes the NPE with multi-threaded mode.

          The problem was that the resolver is supposed to be looked up just-in-time for the current entity but ThreadedContext delegated the getResolvedEntityAttribute method to super class which did not have one. The fix was to override getResolvedEntityAttribute correctly.

          I added the TestTikaEntityProcessor patch by Frank in the patch.

          This does not solve the problem with the evaluators - I'll add a patch in SOLR-2463 to fix it.

          Show
          Shalin Shekhar Mangar added a comment - A patch which fixes the NPE with multi-threaded mode. The problem was that the resolver is supposed to be looked up just-in-time for the current entity but ThreadedContext delegated the getResolvedEntityAttribute method to super class which did not have one. The fix was to override getResolvedEntityAttribute correctly. I added the TestTikaEntityProcessor patch by Frank in the patch. This does not solve the problem with the evaluators - I'll add a patch in SOLR-2463 to fix it.
          Hide
          Frank Wesemann added a comment -

          Adds a test for <entity threads="1" ...>

          Show
          Frank Wesemann added a comment - Adds a test for <entity threads="1" ...>
          Hide
          Frank Wesemann added a comment -

          was wrong file extension

          Show
          Frank Wesemann added a comment - was wrong file extension
          Hide
          Frank Wesemann added a comment -

          a testcase and an improved Patch.
          The patch also patches VariableResolverImpl to avoid that defaults may be null.

          Show
          Frank Wesemann added a comment - a testcase and an improved Patch. The patch also patches VariableResolverImpl to avoid that defaults may be null .
          Hide
          Robert Muir added a comment -

          Hi Lance/Frank,

          Thanks for working on this issue.

          Any ideas on how we could make a junit test to show the problem?
          This would make it easier to evaluate the patch and possible to prevent regressions.

          Show
          Robert Muir added a comment - Hi Lance/Frank, Thanks for working on this issue. Any ideas on how we could make a junit test to show the problem? This would make it easier to evaluate the patch and possible to prevent regressions.
          Hide
          Frank Wesemann added a comment - - edited

          This improved patch addresses the issue that Evaluators ( classes inheriting from o.a.s.dih.Evaluator )used on an arbitrary entityAttribute may hit an empty Context in "threaded" mode.

          This is done by setting Context.CURRENT_CONTEXT.set(context); before initEntity() is called.

          This patch may also adress SOLR-2463 (or at least give a hint that somewhere else CURRENT_CONTEXT is set lately).

          Show
          Frank Wesemann added a comment - - edited This improved patch addresses the issue that Evaluators ( classes inheriting from o.a.s.dih.Evaluator )used on an arbitrary entityAttribute may hit an empty Context in "threaded" mode. This is done by setting Context.CURRENT_CONTEXT.set(context); before initEntity() is called. This patch may also adress SOLR-2463 (or at least give a hint that somewhere else CURRENT_CONTEXT is set lately).
          Hide
          Frank Wesemann added a comment -

          see comment before

          Show
          Frank Wesemann added a comment - see comment before
          Hide
          Frank Wesemann added a comment - - edited

          The added patch addresses the problem that EntityProcessors do not have a usable VariableResolver in their init() Method.
          This is done in the EntityRunner's runAThread() Method by first initing the EntityProcessorWrapper and after that initing the Entityprocessor.
          By changing the order as described the according namespaces a created on the variableResolver before it can be used by the EntityProcessor.

          Additionally I changed the loglevel for the "adding a row" messages to "debug".

          This patch does don't solve the problem described in SOLR-2544.
          As a workaround EntityProcessors may call context.getVariableResolver().replaceTokens()

          Show
          Frank Wesemann added a comment - - edited The added patch addresses the problem that EntityProcessors do not have a usable VariableResolver in their init() Method. This is done in the EntityRunner's runAThread() Method by first initing the EntityProcessorWrapper and after that initing the Entityprocessor. By changing the order as described the according namespaces a created on the variableResolver before it can be used by the EntityProcessor. Additionally I changed the loglevel for the "adding a row" messages to "debug". This patch does don't solve the problem described in SOLR-2544 . As a workaround EntityProcessors may call context.getVariableResolver().replaceTokens()
          Hide
          Lance Norskog added a comment -

          Lance, can you update this patch and add a unit test?

          Sorry Grant, this wasn't on my watch list. This patch is not a patch to fix it, it is a patch to demonstrate the problem. I don't know the right way to solve this.

          Show
          Lance Norskog added a comment - Lance, can you update this patch and add a unit test? Sorry Grant, this wasn't on my watch list. This patch is not a patch to fix it, it is a patch to demonstrate the problem. I don't know the right way to solve this.
          Hide
          Fuad Efendi added a comment -

          I resolved this issue for SQL, SOLR-2233; it was related to 'thread A closes connection needed by thread B'

          Show
          Fuad Efendi added a comment - I resolved this issue for SQL, SOLR-2233 ; it was related to 'thread A closes connection needed by thread B'
          Hide
          Grant Ingersoll added a comment -

          Lance, can you update this patch and add a unit test?

          Show
          Grant Ingersoll added a comment - Lance, can you update this patch and add a unit test?
          Hide
          Lance Norskog added a comment -

          Two answers:
          1) try it and see. you'll find the usage soon enough
          2) TikaEntityProcessor, branch 3.x, line 96:

            public Map<String, Object> nextRow() {
              if(done) return null;
              Map<String, Object> row = new HashMap<String, Object>();
              DataSource<InputStream> dataSource = context.getDataSource();
              InputStream is = dataSource.getData(context.getResolvedEntityAttribute(URL));       <-----
              ContentHandler contentHandler = null;
              Metadata metadata = new Metadata();
          
          Show
          Lance Norskog added a comment - Two answers: 1) try it and see. you'll find the usage soon enough 2) TikaEntityProcessor, branch 3.x, line 96: public Map< String , Object > nextRow() { if (done) return null ; Map< String , Object > row = new HashMap< String , Object >(); DataSource<InputStream> dataSource = context.getDataSource(); InputStream is = dataSource.getData(context.getResolvedEntityAttribute(URL)); <----- ContentHandler contentHandler = null ; Metadata metadata = new Metadata();
          Hide
          Fuad Efendi added a comment -

          I can't find any usage of resolver in TikaEP.nextRow(); am I missing something?
          Thanks

          Show
          Fuad Efendi added a comment - I can't find any usage of resolver in TikaEP.nextRow(); am I missing something? Thanks
          Hide
          Lance Norskog added a comment - - edited

          This patch file fixes up the DataImportHandler so that the TikaEntityProcessor works under threads.

          The technique is to pass in a resolver when creating a ThreadedContext (wrapper). This allows TikaEP.firstInit() to work. However, TikaEP.nextRow is called with a context without a functioning resolver, so: TikeEP caches the resolver given in firstInit() and uses it during nextRow() instead of using the one it should use. Even so, the parsed text is spewed to the logger in addition to being indexed.

          This is not intended as fix patch; it merely demonstrates the problem.

          The patch is made with 'git diff' and I still haven't mastered it; some 'patch' programs may not like it.

          Show
          Lance Norskog added a comment - - edited This patch file fixes up the DataImportHandler so that the TikaEntityProcessor works under threads. The technique is to pass in a resolver when creating a ThreadedContext (wrapper). This allows TikaEP.firstInit() to work. However, TikaEP.nextRow is called with a context without a functioning resolver, so: TikeEP caches the resolver given in firstInit() and uses it during nextRow() instead of using the one it should use. Even so, the parsed text is spewed to the logger in addition to being indexed. This is not intended as fix patch; it merely demonstrates the problem. The patch is made with 'git diff' and I still haven't mastered it; some 'patch' programs may not like it.
          Hide
          Lance Norskog added a comment -

          I've tracked it down. The ThreadedContext object is built without a resolver. There is a notation that the resolver will be set dynamicall but it is not.

          The ThreadedContext resolver is called in the "firstInit" methods TikaEntityProcessor, LineEntityProcessor, and XPathEntityProcessor. TikaEntityProcessor also calls it in nextRow.

          public class ThreadedContext extends ContextImpl{
          private DocBuilder.EntityRunner entityRunner;
          private boolean limitedContext = false;

          public ThreadedContext(DocBuilder.EntityRunner entityRunner, DocBuilder docBuilder)

          { super(entityRunner.entity, null,//to be fethed realtime null, null, docBuilder.session, null, docBuilder); this.entityRunner = entityRunner; }

          I hacked DocBuilder.java to throw in a resolver and that allowed the TikaEP to function during firstInit. Then, the entity attribute resolver failed in the nextRow method.

          TikaEP is the only class that calls the entity attribute resolver outside of the firstInit() call. Is it possible to change TikeEP to only use the resolver in firstInit?

          Show
          Lance Norskog added a comment - I've tracked it down. The ThreadedContext object is built without a resolver. There is a notation that the resolver will be set dynamicall but it is not. The ThreadedContext resolver is called in the "firstInit" methods TikaEntityProcessor, LineEntityProcessor, and XPathEntityProcessor. TikaEntityProcessor also calls it in nextRow. public class ThreadedContext extends ContextImpl{ private DocBuilder.EntityRunner entityRunner; private boolean limitedContext = false; public ThreadedContext(DocBuilder.EntityRunner entityRunner, DocBuilder docBuilder) { super(entityRunner.entity, null,//to be fethed realtime null, null, docBuilder.session, null, docBuilder); this.entityRunner = entityRunner; } I hacked DocBuilder.java to throw in a resolver and that allowed the TikaEP to function during firstInit. Then, the entity attribute resolver failed in the nextRow method. TikaEP is the only class that calls the entity attribute resolver outside of the firstInit() call. Is it possible to change TikeEP to only use the resolver in firstInit?
          Hide
          Lance Norskog added a comment - - edited

          This is the dataConfig.xml. It is very simple: it walks a directory and indexes every PDF file it finds.
          If you change threads='4' to threads='1', it will still fail. If you remove the threads directive, it runs.

          <dataConfig>
             <dataSource type="BinFileDataSource"/>
             <document>
               <entity name="jc" dataSource="null"
                       pk="id"
                       processor="FileListEntityProcessor"
                       fileName="^.*\.pdf$" recursive="false"
                       baseDir="/lucid/private_pdfs/10.pdfs"
                       transformer="TemplateTransformer"
                       threads='4'
                       >
          
                  <field column="id" template="${jc.fileAbsolutePath}"/>
          
                  <entity name="tika-test" processor="TikaEntityProcessor"
                          url="${jc.fileAbsolutePath}"
                          parser="org.apache.tika.parser.pdf.PDFParser"
                          onError="skip"
                          >
                          <field column="Author" name="author" meta="true"/>
                          <field column="title" name="title" meta="true"/>
                          <field column="text" name="text"/>
                  </entity>
                </entity>
              </document>
          </dataConfig>
          
          Show
          Lance Norskog added a comment - - edited This is the dataConfig.xml. It is very simple: it walks a directory and indexes every PDF file it finds. If you change threads='4' to threads='1', it will still fail. If you remove the threads directive, it runs. <dataConfig> <dataSource type="BinFileDataSource"/> <document> <entity name="jc" dataSource="null" pk="id" processor="FileListEntityProcessor" fileName="^.*\.pdf$" recursive="false" baseDir="/lucid/private_pdfs/10.pdfs" transformer="TemplateTransformer" threads='4' > <field column="id" template="${jc.fileAbsolutePath}"/> <entity name="tika-test" processor="TikaEntityProcessor" url="${jc.fileAbsolutePath}" parser="org.apache.tika.parser.pdf.PDFParser" onError="skip" > <field column="Author" name="author" meta="true"/> <field column="title" name="title" meta="true"/> <field column="text" name="text"/> </entity> </entity> </document> </dataConfig>
          Hide
          Lance Norskog added a comment -

          This is the stack trace. The operation configures 4 threads and then does a full-import:

          Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.DocBuilder doFullDump
          INFO: running multithreaded full-import
          Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow
          INFO: arow :

          {fileSize=18837, fileLastModified=Wed Nov 21 08:15:23 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.1.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.1.pdf}

          Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow
          INFO: arow :

          {fileSize=289898, fileLastModified=Wed Nov 21 08:15:25 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.10.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.10.pdf}

          Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow
          INFO: arow :

          {fileSize=121847, fileLastModified=Wed Nov 21 08:15:43 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.100.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.100.pdf}

          Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow
          INFO: arow :

          {fileSize=59844, fileLastModified=Wed Nov 21 08:18:49 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.1000.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.1000.pdf}

          Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.DocBuilder doFullDump
          SEVERE: error in import
          java.lang.NullPointerException
          at org.apache.solr.handler.dataimport.ContextImpl.getResolvedEntityAttribute(ContextImpl.java:79)
          at org.apache.solr.handler.dataimport.ThreadedContext.getResolvedEntityAttribute(ThreadedContext.java:78)
          at org.apache.solr.handler.dataimport.TikaEntityProcessor.firstInit(TikaEntityProcessor.java:67)
          at org.apache.solr.handler.dataimport.EntityProcessorBase.init(EntityProcessorBase.java:56)
          at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.initEntity(DocBuilder.java:507)
          at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:425)
          at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.run(DocBuilder.java:386)
          at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:453)
          at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.access$000(DocBuilder.java:340)
          at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner$1.run(DocBuilder.java:393)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:619)
          Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.DocBuilder finish
          INFO: Import completed successfully
          Oct 21, 2010 10:21:16 PM org.apache.solr.update.DirectUpdateHandler2 commit
          INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)

          Show
          Lance Norskog added a comment - This is the stack trace. The operation configures 4 threads and then does a full-import: Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.DocBuilder doFullDump INFO: running multithreaded full-import Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow INFO: arow : {fileSize=18837, fileLastModified=Wed Nov 21 08:15:23 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.1.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.1.pdf} Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow INFO: arow : {fileSize=289898, fileLastModified=Wed Nov 21 08:15:25 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.10.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.10.pdf} Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow INFO: arow : {fileSize=121847, fileLastModified=Wed Nov 21 08:15:43 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.100.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.100.pdf} Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow INFO: arow : {fileSize=59844, fileLastModified=Wed Nov 21 08:18:49 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.1000.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.1000.pdf} Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.DocBuilder doFullDump SEVERE: error in import java.lang.NullPointerException at org.apache.solr.handler.dataimport.ContextImpl.getResolvedEntityAttribute(ContextImpl.java:79) at org.apache.solr.handler.dataimport.ThreadedContext.getResolvedEntityAttribute(ThreadedContext.java:78) at org.apache.solr.handler.dataimport.TikaEntityProcessor.firstInit(TikaEntityProcessor.java:67) at org.apache.solr.handler.dataimport.EntityProcessorBase.init(EntityProcessorBase.java:56) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.initEntity(DocBuilder.java:507) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:425) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.run(DocBuilder.java:386) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:453) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.access$000(DocBuilder.java:340) at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner$1.run(DocBuilder.java:393) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.DocBuilder finish INFO: Import completed successfully Oct 21, 2010 10:21:16 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Lance Norskog
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development