Details

      Description

      Apache TIKA 1.6 came out yesterday, we should upgrade it.

      The dependencies of bundled Apache POI changed (xmlbeans upgraded, already done. dom4j is obsolete). We have to carefully verify the dependency tree!!!

      1. SOLR-6488.patch
        49 kB
        Uwe Schindler
      2. SOLR-6488.patch
        46 kB
        Uwe Schindler
      3. SOLR-6488.patch
        23 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Initial patch with updated lib versions.
          There are still some dependencies for crazy parsers missing, I will review.

          The current test-suite fails, because some of the parsers seem to add a new metadata field:

             [junit4] Started J0 PID(7180@VEGA).
             [junit4] Suite: org.apache.solr.handler.extraction.ExtractingRequestHandlerTest
             [junit4]   2> Creating dataDir: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr1\solr\build\contrib\solr-cell\test\J0\.\temp\
          solr.handler.extraction.ExtractingRequestHandlerTest-3D229694F89D0471-001\init-core-data-001
             [junit4]   2> log4j:WARN No appenders could be found for logger (org.apache.solr.SolrTestCaseJ4).
             [junit4]   2> log4j:WARN Please initialize the log4j system properly.
             [junit4]   2> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=ExtractingRequestHandlerTest -Dtests.method=testLiterals -Dtests.seed=3D
          229694F89D0471 -Dtests.locale=sq -Dtests.timezone=SystemV/AST4ADT -Dtests.file.encoding=US-ASCII
             [junit4] ERROR   0.14s | ExtractingRequestHandlerTest.testLiterals <<<
             [junit4]    > Throwable #1: org.apache.solr.common.SolrException: ERROR: [doc=three] unknown field 'X-Parsed-By'
             [junit4]    >        at __randomizedtesting.SeedInfo.seed([3D229694F89D0471:D30A12C89093A342]:0)
             [junit4]    >        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183)
             [junit4]    >        at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:79)
             [junit4]    >        at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238)
             [junit4]    >        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
             [junit4]    >        at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
             [junit4]    >        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
             [junit4]    >        at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:89
          5)
             [junit4]    >        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:69
          2)
             [junit4]    >        at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
             [junit4]    >        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)
             [junit4]    >        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)
             [junit4]    >        at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
             [junit4]    >        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
             [junit4]    >        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
             [junit4]    >        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1985)
             [junit4]    >        at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:317)
             [junit4]    >        at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.ja
          va:619)
             [junit4]    >        at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.testLiterals(ExtractingRequestHandlerTest
          .java:275)
             [junit4]    >        at java.lang.Thread.run(Thread.java:745)
             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=ExtractingRequestHandlerTest -Dtests.method=testPlainTextSpecifyingResou
          rceName -Dtests.seed=3D229694F89D0471 -Dtests.locale=sq -Dtests.timezone=SystemV/AST4ADT -Dtests.file.encoding=US-ASCII
          

          I have not yet verified what this field contains (maybe TIKA adds it with a static value, in that case we should ignore it (because we don't need to add the same field always with same content to index.

          In addition, dom4j was removed from TIKA, but there is still something in solr-core that needs dom4j.jar. This is a really outdated and no longer useable lib. Can we nuke it. But Solr itsself is not using it, so I think maybe hadoop? If Mark Miller has an idea who depends on this, I would be happy. Also the dependency validator complains about a circular dep:

          [libversions]   circular dependency found: dom4j#dom4j;1.6.1->jaxen#jaxen;1.1-beta-6->dom4j#dom4j;1.5.2
          

          In addition common-scompress was updated to 1.8.1, it is used at other places, too. I hope this does not conflict with any Solr-internal code.

          Show
          Uwe Schindler added a comment - Initial patch with updated lib versions. There are still some dependencies for crazy parsers missing, I will review. The current test-suite fails, because some of the parsers seem to add a new metadata field: [junit4] Started J0 PID(7180@VEGA). [junit4] Suite: org.apache.solr.handler.extraction.ExtractingRequestHandlerTest [junit4] 2> Creating dataDir: C:\Users\Uwe Schindler\Projects\lucene\trunk-lusolr1\solr\build\contrib\solr-cell\test\J0\.\temp\ solr.handler.extraction.ExtractingRequestHandlerTest-3D229694F89D0471-001\init-core-data-001 [junit4] 2> log4j:WARN No appenders could be found for logger (org.apache.solr.SolrTestCaseJ4). [junit4] 2> log4j:WARN Please initialize the log4j system properly. [junit4] 2> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=ExtractingRequestHandlerTest -Dtests.method=testLiterals -Dtests.seed=3D 229694F89D0471 -Dtests.locale=sq -Dtests.timezone=SystemV/AST4ADT -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.14s | ExtractingRequestHandlerTest.testLiterals <<< [junit4] > Throwable #1: org.apache.solr.common.SolrException: ERROR: [doc=three] unknown field 'X-Parsed-By' [junit4] > at __randomizedtesting.SeedInfo.seed([3D229694F89D0471:D30A12C89093A342]:0) [junit4] > at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:183) [junit4] > at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:79) [junit4] > at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:238) [junit4] > at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164) [junit4] > at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) [junit4] > at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) [junit4] > at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:89 5) [junit4] > at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:69 2) [junit4] > at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) [junit4] > at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) [junit4] > at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) [junit4] > at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) [junit4] > at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) [junit4] > at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) [junit4] > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1985) [junit4] > at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:317) [junit4] > at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.ja va:619) [junit4] > at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.testLiterals(ExtractingRequestHandlerTest .java:275) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=ExtractingRequestHandlerTest -Dtests.method=testPlainTextSpecifyingResou rceName -Dtests.seed=3D229694F89D0471 -Dtests.locale=sq -Dtests.timezone=SystemV/AST4ADT -Dtests.file.encoding=US-ASCII I have not yet verified what this field contains (maybe TIKA adds it with a static value, in that case we should ignore it (because we don't need to add the same field always with same content to index. In addition, dom4j was removed from TIKA, but there is still something in solr-core that needs dom4j.jar. This is a really outdated and no longer useable lib. Can we nuke it. But Solr itsself is not using it, so I think maybe hadoop? If Mark Miller has an idea who depends on this, I would be happy. Also the dependency validator complains about a circular dep: [libversions] circular dependency found: dom4j#dom4j;1.6.1->jaxen#jaxen;1.1-beta-6->dom4j#dom4j;1.5.2 In addition common-scompress was updated to 1.8.1, it is used at other places, too. I hope this does not conflict with any Solr-internal code.
          Hide
          Uwe Schindler added a comment - - edited

          The following libs are missing currently:

          • java-libpst-0.8.1
          • jcip-annotations-1.0 (not needed, only for compile)
          • jmatio-1.0
          • unidataCommon-4.2.20 (used by netcdf, removed because LGPL)

          I will look up their licenses and add them.

          Show
          Uwe Schindler added a comment - - edited The following libs are missing currently: java-libpst-0.8.1 jcip-annotations-1.0 (not needed, only for compile) jmatio-1.0 unidataCommon-4.2.20 (used by netcdf, removed because LGPL) I will look up their licenses and add them.
          Hide
          Uwe Schindler added a comment - - edited

          This is the full dependency list of tika-parsers:

          [INFO]    org.apache.james:apache-mime4j-core:jar:0.7.2:compile
          [INFO]    org.apache.james:apache-mime4j-dom:jar:0.7.2:compile
          [INFO]    org.aspectj:aspectjrt:jar:1.8.0:compile
          [INFO]    org.apache.pdfbox:fontbox:jar:1.8.6:compile
          [INFO]    net.jcip:jcip-annotations:jar:1.0:compile
          [INFO]    org.apache.pdfbox:jempbox:jar:1.8.6:compile
          [INFO]    com.drewnoakes:metadata-extractor:jar:2.6.2:compile
          [INFO]    commons-logging:commons-logging:jar:1.1.1:compile
          [INFO]    com.uwyn:jhighlight:jar:1.0:compile
          [INFO]    org.apache.xmlbeans:xmlbeans:jar:2.6.0:compile
          [INFO]    org.bouncycastle:bcprov-jdk15:jar:1.45:compile
          [INFO]    org.gagravarr:vorbis-java-core:jar:0.6:compile
          [INFO]    com.googlecode.mp4parser:isoparser:jar:1.0.2:compile
          [INFO]    edu.ucar:unidataCommon:jar:4.2.20:compile
          [INFO]    org.apache.poi:poi:jar:3.11-beta2:compile
          [INFO]    org.apache.poi:poi-ooxml-schemas:jar:3.11-beta2:compile
          [INFO]    net.sourceforge.jmatio:jmatio:jar:1.0:compile
          [INFO]    commons-httpclient:commons-httpclient:jar:3.1:compile
          [INFO]    org.apache.pdfbox:pdfbox:jar:1.8.6:compile
          [INFO]    com.pff:java-libpst:jar:0.8.1:compile
          [INFO]    com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
          [INFO]    org.apache.poi:poi-ooxml:jar:3.11-beta2:compile
          [INFO]    edu.ucar:netcdf:jar:4.2.20:compile
          [INFO]    de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
          [INFO]    org.slf4j:slf4j-api:jar:1.6.1:compile
          [INFO]    commons-codec:commons-codec:jar:1.9:compile
          [INFO]    rome:rome:jar:1.0:compile
          [INFO]    org.gagravarr:vorbis-java-tika:jar:0.6:compile
          [INFO]    org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
          [INFO]    org.ow2.asm:asm-debug-all:jar:4.1:compile
          [INFO]    com.adobe.xmp:xmpcore:jar:5.1.2:compile
          [INFO]    org.apache.commons:commons-compress:jar:1.8.1:compile
          [INFO]    org.bouncycastle:bcmail-jdk15:jar:1.45:compile
          [INFO]    org.apache.tika:tika-core:jar:1.7-SNAPSHOT:compile
          [INFO]    jdom:jdom:jar:1.0:compile
          [INFO]    org.apache.poi:poi-scratchpad:jar:3.11-beta2:compile
          [INFO]    xml-apis:xml-apis:jar:1.3.03:compile
          [INFO]    xerces:xercesImpl:jar:2.8.1:compile
          [INFO]    org.tukaani:xz:jar:1.5:compile
          
          Show
          Uwe Schindler added a comment - - edited This is the full dependency list of tika-parsers: [INFO] org.apache.james:apache-mime4j-core:jar:0.7.2:compile [INFO] org.apache.james:apache-mime4j-dom:jar:0.7.2:compile [INFO] org.aspectj:aspectjrt:jar:1.8.0:compile [INFO] org.apache.pdfbox:fontbox:jar:1.8.6:compile [INFO] net.jcip:jcip-annotations:jar:1.0:compile [INFO] org.apache.pdfbox:jempbox:jar:1.8.6:compile [INFO] com.drewnoakes:metadata-extractor:jar:2.6.2:compile [INFO] commons-logging:commons-logging:jar:1.1.1:compile [INFO] com.uwyn:jhighlight:jar:1.0:compile [INFO] org.apache.xmlbeans:xmlbeans:jar:2.6.0:compile [INFO] org.bouncycastle:bcprov-jdk15:jar:1.45:compile [INFO] org.gagravarr:vorbis-java-core:jar:0.6:compile [INFO] com.googlecode.mp4parser:isoparser:jar:1.0.2:compile [INFO] edu.ucar:unidataCommon:jar:4.2.20:compile [INFO] org.apache.poi:poi:jar:3.11-beta2:compile [INFO] org.apache.poi:poi-ooxml-schemas:jar:3.11-beta2:compile [INFO] net.sourceforge.jmatio:jmatio:jar:1.0:compile [INFO] commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] org.apache.pdfbox:pdfbox:jar:1.8.6:compile [INFO] com.pff:java-libpst:jar:0.8.1:compile [INFO] com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile [INFO] org.apache.poi:poi-ooxml:jar:3.11-beta2:compile [INFO] edu.ucar:netcdf:jar:4.2.20:compile [INFO] de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile [INFO] org.slf4j:slf4j-api:jar:1.6.1:compile [INFO] commons-codec:commons-codec:jar:1.9:compile [INFO] rome:rome:jar:1.0:compile [INFO] org.gagravarr:vorbis-java-tika:jar:0.6:compile [INFO] org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile [INFO] org.ow2.asm:asm-debug-all:jar:4.1:compile [INFO] com.adobe.xmp:xmpcore:jar:5.1.2:compile [INFO] org.apache.commons:commons-compress:jar:1.8.1:compile [INFO] org.bouncycastle:bcmail-jdk15:jar:1.45:compile [INFO] org.apache.tika:tika-core:jar:1.7-SNAPSHOT:compile [INFO] jdom:jdom:jar:1.0:compile [INFO] org.apache.poi:poi-scratchpad:jar:3.11-beta2:compile [INFO] xml-apis:xml-apis:jar:1.3.03:compile [INFO] xerces:xercesImpl:jar:2.8.1:compile [INFO] org.tukaani:xz:jar:1.5:compile
          Hide
          Uwe Schindler added a comment - - edited

          tika-xmp is completely unused, this was a bug to include it. Can be removed. It has nothing to do with the adobe-xmp stuff.

          Somehow morphlines needs it.

          Show
          Uwe Schindler added a comment - - edited tika-xmp is completely unused, this was a bug to include it. Can be removed. It has nothing to do with the adobe-xmp stuff. Somehow morphlines needs it.
          Hide
          Uwe Schindler added a comment -

          New patch with missing dependencies and their licenses/notices.

          I removed support for netcdf (which was already incomplete before), because the netcdf.jar file contains LGPL licensed code (see TIKA-763 and TIKA-766).

          The only remaining thing is the crazy new metadata field (X-Parsed-By), making the tests fail. I will investigate tomorrow.

          Show
          Uwe Schindler added a comment - New patch with missing dependencies and their licenses/notices. I removed support for netcdf (which was already incomplete before), because the netcdf.jar file contains LGPL licensed code (see TIKA-763 and TIKA-766 ). The only remaining thing is the crazy new metadata field (X-Parsed-By), making the tests fail. I will investigate tomorrow.
          Hide
          Uwe Schindler added a comment -

          X-Parsed-By is a new metadata field in TIKA, which contains the TIKA parser class used. I will add missing Schema fields: TIKA-674

          Show
          Uwe Schindler added a comment - X-Parsed-By is a new metadata field in TIKA, which contains the TIKA parser class used. I will add missing Schema fields: TIKA-674
          Hide
          Uwe Schindler added a comment -

          Final patch. All tests pass.

          Show
          Uwe Schindler added a comment - Final patch. All tests pass.
          Hide
          ASF subversion and git services added a comment -

          Commit 1623225 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1623225 ]

          SOLR-6488: Upgrade Solr Cell to TIKA 1.6

          Show
          ASF subversion and git services added a comment - Commit 1623225 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1623225 ] SOLR-6488 : Upgrade Solr Cell to TIKA 1.6
          Hide
          ASF subversion and git services added a comment -

          Commit 1623227 from Uwe Schindler in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1623227 ]

          Merged revision(s) 1623225 from lucene/dev/trunk:
          SOLR-6488: Upgrade Solr Cell to TIKA 1.6

          Show
          ASF subversion and git services added a comment - Commit 1623227 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1623227 ] Merged revision(s) 1623225 from lucene/dev/trunk: SOLR-6488 : Upgrade Solr Cell to TIKA 1.6
          Hide
          ASF subversion and git services added a comment -

          Commit 1623308 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1623308 ]

          SOLR-6489: Disable Morphlines-Cell tests, because Update to Tika 1.6 (SOLR-6488) broke them

          Show
          ASF subversion and git services added a comment - Commit 1623308 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1623308 ] SOLR-6489 : Disable Morphlines-Cell tests, because Update to Tika 1.6 ( SOLR-6488 ) broke them
          Hide
          ASF subversion and git services added a comment -

          Commit 1623309 from Uwe Schindler in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1623309 ]

          Merged revision(s) 1623308 from lucene/dev/trunk:
          SOLR-6489: Disable Morphlines-Cell tests, because Update to Tika 1.6 (SOLR-6488) broke them

          Show
          ASF subversion and git services added a comment - Commit 1623309 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1623309 ] Merged revision(s) 1623308 from lucene/dev/trunk: SOLR-6489 : Disable Morphlines-Cell tests, because Update to Tika 1.6 ( SOLR-6488 ) broke them
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development