Details

      Description

      Apache TIKA 1.7 was released: https://dist.apache.org/repos/dist/release/tika/CHANGES-1.7.txt

      This is more or less a dependency update, so replacements. Not sure if we should do this for 5.0. In 5.0 we currently have the previous version, which was not yet released with Solr. If we now bring this into 5.0, we wouldn't have a new release 2 times. I can change the stuff this evening and let it bake in 5.x, so maybe we backport this.

      1. SOLR-6991.patch
        18 kB
        Uwe Schindler
      2. SOLR-6991.patch
        11 kB
        Uwe Schindler
      3. SOLR-6991-forkfix.patch
        2 kB
        Steve Rowe
      4. SOLR-6991-forkfix.patch
        1 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Anshum Gupta added a comment -

          Uwe did you recommend upgrading Tika for 5.0? If someone can do it and no one minds, I'm actually ok with it specially if it's just a drop in.

          Show
          Anshum Gupta added a comment - Uwe did you recommend upgrading Tika for 5.0? If someone can do it and no one minds, I'm actually ok with it specially if it's just a drop in.
          Hide
          Uwe Schindler added a comment -

          I will check out tomorrow morning. If it is not just plug in, I leave it for now. So don't stop the release process, if you are working on it.

          Show
          Uwe Schindler added a comment - I will check out tomorrow morning. If it is not just plug in, I leave it for now. So don't stop the release process, if you are working on it.
          Hide
          Uwe Schindler added a comment - - edited

          Here is the patch. It is just updating dependencies. I am currently running tests, but looks fine. I will check if SOLR-6489 tests are still not working, but this does not really affect this update.
          This also adds tika-java7 SPI library to support filetype detection using bundled Java 7 tools.

          Show
          Uwe Schindler added a comment - - edited Here is the patch. It is just updating dependencies. I am currently running tests, but looks fine. I will check if SOLR-6489 tests are still not working, but this does not really affect this update. This also adds tika-java7 SPI library to support filetype detection using bundled Java 7 tools.
          Hide
          Uwe Schindler added a comment - - edited

          Patch that updates CHANGES.txt and versions.

          Anshum Gupta: It is up to you: Should I commit to 5.0 branch, too? One important thing is: It uses a non-beta version of Apache POI, so I have a better feeling

          I will check the morphlines again and then commit to trunk and 5.x.

          Show
          Uwe Schindler added a comment - - edited Patch that updates CHANGES.txt and versions. Anshum Gupta : It is up to you: Should I commit to 5.0 branch, too? One important thing is: It uses a non-beta version of Apache POI, so I have a better feeling I will check the morphlines again and then commit to trunk and 5.x.
          Hide
          ASF subversion and git services added a comment -

          Commit 1652742 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1652742 ]

          SOLR-6991: Update to Apache TIKA 1.7

          Show
          ASF subversion and git services added a comment - Commit 1652742 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1652742 ] SOLR-6991 : Update to Apache TIKA 1.7
          Hide
          ASF subversion and git services added a comment -

          Commit 1652743 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1652743 ]

          SOLR-6991: Apply correct sorting

          Show
          ASF subversion and git services added a comment - Commit 1652743 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1652743 ] SOLR-6991 : Apply correct sorting
          Hide
          ASF subversion and git services added a comment -

          Commit 1652745 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1652745 ]

          SOLR-6991: Update to Apache TIKA 1.7

          Show
          ASF subversion and git services added a comment - Commit 1652745 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1652745 ] SOLR-6991 : Update to Apache TIKA 1.7
          Hide
          Uwe Schindler added a comment -

          I committed to trunk and branch_5x. I leave the issue open to wait for Anshum Gupta.

          Show
          Uwe Schindler added a comment - I committed to trunk and branch_5x. I leave the issue open to wait for Anshum Gupta .
          Hide
          ASF subversion and git services added a comment -

          Commit 1652783 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1652783 ]

          SOLR-6991: Add missing license and notice. Remove outdated stuff from notice files.

          Show
          ASF subversion and git services added a comment - Commit 1652783 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1652783 ] SOLR-6991 : Add missing license and notice. Remove outdated stuff from notice files.
          Hide
          ASF subversion and git services added a comment -

          Commit 1652784 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1652784 ]

          Merged revision(s) 1652783 from lucene/dev/trunk:
          SOLR-6991: Add missing license and notice. Remove outdated stuff from notice files.

          Show
          ASF subversion and git services added a comment - Commit 1652784 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1652784 ] Merged revision(s) 1652783 from lucene/dev/trunk: SOLR-6991 : Add missing license and notice. Remove outdated stuff from notice files.
          Hide
          Anshum Gupta added a comment - - edited

          Thanks for doing that Uwe Schindler.
          Considering that Tika 1.7 uses a non-beta version of Apache POI and also that ODF parsing in Tika 1.6 is actually broken, i.e. throws exceptions for any ODF doc while Tika 1.7 fixes that, I think we should go with 1.7 on 5.0.

          Show
          Anshum Gupta added a comment - - edited Thanks for doing that Uwe Schindler . Considering that Tika 1.7 uses a non-beta version of Apache POI and also that ODF parsing in Tika 1.6 is actually broken, i.e. throws exceptions for any ODF doc while Tika 1.7 fixes that, I think we should go with 1.7 on 5.0.
          Hide
          Uwe Schindler added a comment -

          OK, I backport!

          Show
          Uwe Schindler added a comment - OK, I backport!
          Hide
          ASF subversion and git services added a comment -

          Commit 1652831 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1652831 ]

          SOLR-6991: Move changes

          Show
          ASF subversion and git services added a comment - Commit 1652831 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1652831 ] SOLR-6991 : Move changes
          Hide
          ASF subversion and git services added a comment -

          Commit 1652832 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1652832 ]

          Merged revision(s) 1652831 from lucene/dev/trunk:
          SOLR-6991: Move changes

          Show
          ASF subversion and git services added a comment - Commit 1652832 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1652832 ] Merged revision(s) 1652831 from lucene/dev/trunk: SOLR-6991 : Move changes
          Hide
          ASF subversion and git services added a comment -

          Commit 1652834 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1652834 ]

          Merged revision(s) 1652742-1652743, 1652783, 1652831 from lucene/dev/trunk:
          SOLR-6991: Update to Apache TIKA 1.7
          SOLR-6991: Apply correct sorting
          SOLR-6991: Add missing license and notice. Remove outdated stuff from notice files.
          SOLR-6991: Move changes

          Show
          ASF subversion and git services added a comment - Commit 1652834 from Uwe Schindler in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1652834 ] Merged revision(s) 1652742-1652743, 1652783, 1652831 from lucene/dev/trunk: SOLR-6991 : Update to Apache TIKA 1.7 SOLR-6991 : Apply correct sorting SOLR-6991 : Add missing license and notice. Remove outdated stuff from notice files. SOLR-6991 : Move changes
          Hide
          Uwe Schindler added a comment -

          OK. I backported. I'll trigger a smoker build to be sure all is fine.

          Show
          Uwe Schindler added a comment - OK. I backported. I'll trigger a smoker build to be sure all is fine.
          Hide
          Uwe Schindler added a comment -

          also that ODF parsing in Tika 1.6 is actually broken, i.e. throws exceptions for any ODF doc while Tika 1.7 fixes that

          I asume you mean that one: TIKA-1412 Good catch. We catually have no test documents in contrib/extraction in ODF format. We should add one, would you open issue? I can check that out.

          Show
          Uwe Schindler added a comment - also that ODF parsing in Tika 1.6 is actually broken, i.e. throws exceptions for any ODF doc while Tika 1.7 fixes that I asume you mean that one: TIKA-1412 Good catch. We catually have no test documents in contrib/extraction in ODF format. We should add one, would you open issue? I can check that out.
          Hide
          Uwe Schindler added a comment -

          Smoke tester of 5.0 was happy.

          Show
          Uwe Schindler added a comment - Smoke tester of 5.0 was happy.
          Hide
          Anshum Gupta added a comment -

          Thanks for taking that up Uwe. I've created SOLR-6996 for adding an ODF doc test for contrib/extraction.

          Show
          Anshum Gupta added a comment - Thanks for taking that up Uwe. I've created SOLR-6996 for adding an ODF doc test for contrib/extraction.
          Hide
          Steve Rowe added a comment -

          Reopening to address this Mac OS X failure in solr-cell:

          Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-MacOSX/1943/
          Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC
          
          1 tests failed.
          FAILED:  org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.testXPath
          
          ...
          
            [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=ExtractingRequestHandlerTest -Dtests.method=testXPath -Dtests.seed=58A6FBEB77E81527 -Dtests.slow=true -Dtests.locale=tr_TR -Dtests.timezone=Etc/GMT+3 -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
            [junit4] ERROR   2.57s | ExtractingRequestHandlerTest.testXPath <<<
            [junit4]    > Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform.
            [junit4]    > 	at __randomizedtesting.SeedInfo.seed([58A6FBEB77E81527:26F735786F5F7761]:0)
            [junit4]    > 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
            [junit4]    > 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
            [junit4]    > 	at java.security.AccessController.doPrivileged(Native Method)
            [junit4]    > 	at java.lang.UNIXProcess.<clinit>(UNIXProcess.java:92)
            [junit4]    > 	at java.lang.ProcessImpl.start(ProcessImpl.java:130)
            [junit4]    > 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
            [junit4]    > 	at java.lang.Runtime.exec(Runtime.java:620)
            [junit4]    > 	at java.lang.Runtime.exec(Runtime.java:485)
            [junit4]    > 	at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
            [junit4]    > 	at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
            [junit4]    > 	at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
            [junit4]    > 	at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
            [junit4]    > 	at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
            [junit4]    > 	at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
            [junit4]    > 	at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
            [junit4]    > 	at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
            [junit4]    > 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
            [junit4]    > 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
            [junit4]    > 	at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
            [junit4]    > 	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
            [junit4]    > 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
            [junit4]    > 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006)
            [junit4]    > 	at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:353)
            [junit4]    > 	at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.loadLocalFromHandler(ExtractingRequestHandlerTest.java:703)
            [junit4]    > 	at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:710)
            [junit4]    > 	at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.testXPath(ExtractingRequestHandlerTest.java:474)
            [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
          
          Show
          Steve Rowe added a comment - Reopening to address this Mac OS X failure in solr-cell: Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-MacOSX/1943/ Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC 1 tests failed. FAILED: org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.testXPath ... [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=ExtractingRequestHandlerTest -Dtests.method=testXPath -Dtests.seed=58A6FBEB77E81527 -Dtests.slow=true -Dtests.locale=tr_TR -Dtests.timezone=Etc/GMT+3 -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 2.57s | ExtractingRequestHandlerTest.testXPath <<< [junit4] > Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] > at __randomizedtesting.SeedInfo.seed([58A6FBEB77E81527:26F735786F5F7761]:0) [junit4] > at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] > at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] > at java.security.AccessController.doPrivileged(Native Method) [junit4] > at java.lang.UNIXProcess.<clinit>(UNIXProcess.java:92) [junit4] > at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] > at java.lang.Runtime.exec(Runtime.java:620) [junit4] > at java.lang.Runtime.exec(Runtime.java:485) [junit4] > at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] > at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] > at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] > at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] > at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] > at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] > at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] > at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) [junit4] > at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) [junit4] > at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) [junit4] > at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144) [junit4] > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) [junit4] > at org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:353) [junit4] > at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.loadLocalFromHandler(ExtractingRequestHandlerTest.java:703) [junit4] > at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:710) [junit4] > at org.apache.solr.handler.extraction.ExtractingRequestHandlerTest.testXPath(ExtractingRequestHandlerTest.java:474) [junit4] > at java.lang.Thread.run(Thread.java:745)
          Hide
          Steve Rowe added a comment -

          I can reproduce this on OS X 10.10 using Oracle JDK 1.8.0_20.

          When I revert back to r1652741 (just before the first commit under this issue), all solr-cell tests pass using the following (same thing that fails 100% for me with current trunk):

          ant clean
          cd solr/contrib/extraction
          ant test -Dtests.slow=true -Dtests.locale=tr_TR
          
          Show
          Steve Rowe added a comment - I can reproduce this on OS X 10.10 using Oracle JDK 1.8.0_20. When I revert back to r1652741 (just before the first commit under this issue), all solr-cell tests pass using the following (same thing that fails 100% for me with current trunk): ant clean cd solr/contrib/extraction ant test -Dtests.slow=true -Dtests.locale=tr_TR
          Hide
          Hoss Man added a comment -

          TIKA-93 introduced the TesseractOCRParser, and TIKA-1476 enabled it as a default parser.

          that combination means that the first time Tika is used in Solr, the TesseractOCRParser will be checked to see if the system "hasTesseract" installed to know if that parser should be consulted – and when that happens, ExternalParser.check is used which calls Runtime.exec and blows up in turkish locale.


          possible resolutions i can think of:

          • change how we init Tika to prevent this parser from ever being used (override the list of autodeteced parsers?)
          • change how we include tika jars/defaults to prevent this parser from ever being used (override the default tesseract properties file in the jar somehow maybe?)
          • rollback to tika 1.6
          • punt and advise turkish users to run their jvm in en_US ?
          Show
          Hoss Man added a comment - TIKA-93 introduced the TesseractOCRParser, and TIKA-1476 enabled it as a default parser. that combination means that the first time Tika is used in Solr, the TesseractOCRParser will be checked to see if the system "hasTesseract" installed to know if that parser should be consulted – and when that happens, ExternalParser.check is used which calls Runtime.exec and blows up in turkish locale. possible resolutions i can think of: change how we init Tika to prevent this parser from ever being used (override the list of autodeteced parsers?) change how we include tika jars/defaults to prevent this parser from ever being used (override the default tesseract properties file in the jar somehow maybe?) rollback to tika 1.6 punt and advise turkish users to run their jvm in en_US ?
          Hide
          Uwe Schindler added a comment -

          This is in fact the problem with spawning external processes. This is not new, also TIKA 1.6 had parsers that spawned processes. I just think we never hit this because this one is different: The parser spawns a process while initializing (to "inspect the system). The other Spawning parsers are only executed as needed. ExternalParser exists since a long time in TIKA.

          I would not roll back to TIKA 1.5 because the new TIKA is much better than this one (regarding bugs). In fact we should maybe disable this tests with the well-known assume (trunk, 5.x, 5.0). In fact, I would suggest to add a note to the ref guide, so people know what this means. This is unfortunately a bug in the JVM, so this is not really our or TIKA's fault.

          In fact, as written in my Blog post about Locale issues: Most Turkish system administrators don't run servers with the turkish locale Its just too broken with lots of software.

          Show
          Uwe Schindler added a comment - This is in fact the problem with spawning external processes. This is not new, also TIKA 1.6 had parsers that spawned processes. I just think we never hit this because this one is different: The parser spawns a process while initializing (to "inspect the system). The other Spawning parsers are only executed as needed. ExternalParser exists since a long time in TIKA. I would not roll back to TIKA 1.5 because the new TIKA is much better than this one (regarding bugs). In fact we should maybe disable this tests with the well-known assume (trunk, 5.x, 5.0). In fact, I would suggest to add a note to the ref guide, so people know what this means. This is unfortunately a bug in the JVM, so this is not really our or TIKA's fault. In fact, as written in my Blog post about Locale issues: Most Turkish system administrators don't run servers with the turkish locale Its just too broken with lots of software.
          Hide
          Uwe Schindler added a comment -

          In fact you can select parsers using a config file / Set<String>. But this makes updaing horrible, because we have to revisit the list on each TIKA update...

          Show
          Uwe Schindler added a comment - In fact you can select parsers using a config file / Set<String>. But this makes updaing horrible, because we have to revisit the list on each TIKA update...
          Hide
          Uwe Schindler added a comment -

          This disables the test... Just copypasted from map-reduce/morphlines/....

          In fact this is not TIKA's issue and not new, a lot of stuff around Hadoop in Solr fails with Turkish!

          Show
          Uwe Schindler added a comment - This disables the test... Just copypasted from map-reduce/morphlines/.... In fact this is not TIKA's issue and not new, a lot of stuff around Hadoop in Solr fails with Turkish!
          Hide
          Uwe Schindler added a comment -

          FYI: SolrCellMorphlineTest is already disabled by the same assume, so this is the only broken one.

          Show
          Uwe Schindler added a comment - FYI: SolrCellMorphlineTest is already disabled by the same assume, so this is the only broken one.
          Hide
          Hoss Man added a comment -

          In fact this is not TIKA's issue and not new, a lot of stuff around Hadoop in Solr fails with Turkish!

          ...my point is: it's new to Solr.

          in all other cases where POSIX_SPAWN impacts Solr, we either:

          • deal with it in the solr code, so we give a meaningful error to the user explaining the problem (ie: SystemInfoHandler)
          • it's in an optional feature that NEVER worked with turkish – ie: the hadoop / morephlines contribs, from the first version it was available in Solr, would not work with turkish locale

          ...in this case, we're talking about an existing solr feature, that has previously worked fine if you run older Solr with turkish, and now when upgrading to 5.0 you're going to get a weird error message.

          if there's nothing better we can do keep the ExtractionRequestHandler working or users who upgrade (even if they run with turkish) then i'm fine with assumes in the tests and notes in the docs ... i was just hoping you'd have a better idea.

          in particular: I'm still wondering if we can leverage the classpath in a way to override the "default" TesseractOCRConfig.properties file in the tika-parsers jar with our own version that prevents tesseract from being used. (i agree it's not worth switching to explicitly whitelisting the parsers in Solr code, but is there an easy way to blacklist this parser and/or other parsers we know are problematic?)

          Show
          Hoss Man added a comment - In fact this is not TIKA's issue and not new, a lot of stuff around Hadoop in Solr fails with Turkish! ...my point is: it's new to Solr. in all other cases where POSIX_SPAWN impacts Solr, we either: deal with it in the solr code, so we give a meaningful error to the user explaining the problem (ie: SystemInfoHandler) it's in an optional feature that NEVER worked with turkish – ie: the hadoop / morephlines contribs, from the first version it was available in Solr, would not work with turkish locale ...in this case, we're talking about an existing solr feature, that has previously worked fine if you run older Solr with turkish, and now when upgrading to 5.0 you're going to get a weird error message. if there's nothing better we can do keep the ExtractionRequestHandler working or users who upgrade (even if they run with turkish) then i'm fine with assumes in the tests and notes in the docs ... i was just hoping you'd have a better idea. in particular: I'm still wondering if we can leverage the classpath in a way to override the "default" TesseractOCRConfig.properties file in the tika-parsers jar with our own version that prevents tesseract from being used. (i agree it's not worth switching to explicitly whitelisting the parsers in Solr code, but is there an easy way to blacklist this parser and/or other parsers we know are problematic?)
          Hide
          Uwe Schindler added a comment -

          Hi,
          I checked the code. The problem is: You cannot disable by config (because it always tries to execute the command thats part of the default config file). If the config file is not there, then it runs TESSERACT without any path.

          The only way to work around is:

          • Disable the whole parser (f*ck, because then we need to maintain our own parser list internally). There is no way to tell TIKA to exclude some parsers (something like AutodetectParser#disableParser(name/class/whatever)
          • Use a hack with reflection to make TesseractOCRParser#TESSERACT_PRESENT return false for any path... Just replace the static map by one that returns false for any key (LOL) and ignores any put()
          Show
          Uwe Schindler added a comment - Hi, I checked the code. The problem is: You cannot disable by config (because it always tries to execute the command thats part of the default config file). If the config file is not there, then it runs TESSERACT without any path. The only way to work around is: Disable the whole parser (f*ck, because then we need to maintain our own parser list internally). There is no way to tell TIKA to exclude some parsers (something like AutodetectParser#disableParser(name/class/whatever) Use a hack with reflection to make TesseractOCRParser#TESSERACT_PRESENT return false for any path... Just replace the static map by one that returns false for any key (LOL) and ignores any put()
          Hide
          Uwe Schindler added a comment -

          One trick could work:
          TIKA prefers always "external" parsers loaded by SPI. The trick here would be to add a /META-INF/services/... file that lists a subclass of the Tesseract parser that just always returns "no supported media types". TIKA would use our subclass in preference to the one shipped. By that we could disable the parser. I have not checked this, but this would be another hack (that I don't like, too).

          Show
          Uwe Schindler added a comment - One trick could work: TIKA prefers always "external" parsers loaded by SPI. The trick here would be to add a /META-INF/services/... file that lists a subclass of the Tesseract parser that just always returns "no supported media types". TIKA would use our subclass in preference to the one shipped. By that we could disable the parser. I have not checked this, but this would be another hack (that I don't like, too).
          Hide
          Uwe Schindler added a comment -

          The last comment was just an idea, but doesn't work. The problem here is that initialization of the parser fails, so it will always call TesseractOCRParser.getSupportedTypes()...

          Show
          Uwe Schindler added a comment - The last comment was just an idea, but doesn't work. The problem here is that initialization of the parser fails, so it will always call TesseractOCRParser.getSupportedTypes()...
          Hide
          Hoss Man added a comment -

          The last comment was just an idea, but doesn't work. ...

          you fought a good fight uwe, but alas...

          +1 to your SOLR-6991-forkfix.patch for 5.0 .. but don't we need similar assumes in dataimporthandler-extras tests that use TikaEntityProcessor? (i'm not sure why those wouldn't fail with turkish now as well)

          Show
          Hoss Man added a comment - The last comment was just an idea, but doesn't work. ... you fought a good fight uwe, but alas... +1 to your SOLR-6991 -forkfix.patch for 5.0 .. but don't we need similar assumes in dataimporthandler-extras tests that use TikaEntityProcessor? (i'm not sure why those wouldn't fail with turkish now as well)
          Hide
          Steve Rowe added a comment -

          don't we need similar assumes in dataimporthandler-extras tests that use TikaEntityProcessor? (i'm not sure why those wouldn't fail with turkish now as well)

          I ran ant test -Dtests.slow=true -Dtests.locale=tr_TR in solr/contrib/dataimporthandler-extras/, and got the following failure:

             [junit4] Suite: org.apache.solr.handler.dataimport.TestTikaEntityProcessor
             [junit4]   2> Creating dataDir: /Users/sarowe/svn/lucene/dev/trunk2/solr/build/contrib/solr-dataimporthandler-extras/test/J0/temp/solr.handler.dataimport.TestTikaEntityProcessor 9123B7DE098A1C98-001/init-core-data-001
             [junit4]   2> log4j:WARN No appenders could be found for logger (org.apache.solr.SolrTestCaseJ4).
             [junit4]   2> log4j:WARN Please initialize the log4j system properly.
             [junit4]   2> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestTikaEntityProcessor -Dtests.method=testTikaHTMLMapperIdentity -Dtests.seed=9123B7DE098A1C98 -Dtests.slow=true -Dtests.locale=tr_TR -Dtests.timezone=America/Toronto -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
             [junit4] ERROR   0.93s J0 | TestTikaEntityProcessor.testTikaHTMLMapperIdentity <<<
             [junit4]    > Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform.
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([9123B7DE098A1C98:C15C334FC0BEE965]:0)
             [junit4]    > 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
             [junit4]    > 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
             [junit4]    > 	at java.security.AccessController.doPrivileged(Native Method)
             [junit4]    > 	at java.lang.UNIXProcess.<clinit>(UNIXProcess.java:92)
             [junit4]    > 	at java.lang.ProcessImpl.start(ProcessImpl.java:130)
             [junit4]    > 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
             [junit4]    > 	at java.lang.Runtime.exec(Runtime.java:620)
             [junit4]    > 	at java.lang.Runtime.exec(Runtime.java:485)
             [junit4]    > 	at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
             [junit4]    > 	at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
             [junit4]    > 	at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
             [junit4]    > 	at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
             [junit4]    > 	at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
             [junit4]    > 	at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
             [junit4]    > 	at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
             [junit4]    > 	at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
             [junit4]    > 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
             [junit4]    > 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
             [junit4]    > 	at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:141)
             [junit4]    > 	at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
             [junit4]    > 	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
             [junit4]    > 	at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
             [junit4]    > 	at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
             [junit4]    > 	at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
             [junit4]    > 	at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
             [junit4]    > 	at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
             [junit4]    > 	at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:189)
             [junit4]    > 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
             [junit4]    > 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006)
             [junit4]    > 	at org.apache.solr.util.TestHarness.query(TestHarness.java:331)
             [junit4]    > 	at org.apache.solr.handler.dataimport.AbstractDataImportHandlerTestCase.runFullImport(AbstractDataImportHandlerTestCase.java:86)
             [junit4]    > 	at org.apache.solr.handler.dataimport.TestTikaEntityProcessor.testTikaHTMLMapperIdentity(TestTikaEntityProcessor.java:99)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
          
          Show
          Steve Rowe added a comment - don't we need similar assumes in dataimporthandler-extras tests that use TikaEntityProcessor? (i'm not sure why those wouldn't fail with turkish now as well) I ran ant test -Dtests.slow=true -Dtests.locale=tr_TR in solr/contrib/dataimporthandler-extras/ , and got the following failure: [junit4] Suite: org.apache.solr.handler.dataimport.TestTikaEntityProcessor [junit4] 2> Creating dataDir: /Users/sarowe/svn/lucene/dev/trunk2/solr/build/contrib/solr-dataimporthandler-extras/test/J0/temp/solr.handler.dataimport.TestTikaEntityProcessor 9123B7DE098A1C98-001/init-core-data-001 [junit4] 2> log4j:WARN No appenders could be found for logger (org.apache.solr.SolrTestCaseJ4). [junit4] 2> log4j:WARN Please initialize the log4j system properly. [junit4] 2> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestTikaEntityProcessor -Dtests.method=testTikaHTMLMapperIdentity -Dtests.seed=9123B7DE098A1C98 -Dtests.slow=true -Dtests.locale=tr_TR -Dtests.timezone=America/Toronto -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.93s J0 | TestTikaEntityProcessor.testTikaHTMLMapperIdentity <<< [junit4] > Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] > at __randomizedtesting.SeedInfo.seed([9123B7DE098A1C98:C15C334FC0BEE965]:0) [junit4] > at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] > at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] > at java.security.AccessController.doPrivileged(Native Method) [junit4] > at java.lang.UNIXProcess.<clinit>(UNIXProcess.java:92) [junit4] > at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] > at java.lang.Runtime.exec(Runtime.java:620) [junit4] > at java.lang.Runtime.exec(Runtime.java:485) [junit4] > at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] > at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] > at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] > at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] > at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] > at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] > at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] > at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) [junit4] > at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:141) [junit4] > at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) [junit4] > at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476) [junit4] > at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415) [junit4] > at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330) [junit4] > at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232) [junit4] > at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416) [junit4] > at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480) [junit4] > at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:189) [junit4] > at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144) [junit4] > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) [junit4] > at org.apache.solr.util.TestHarness.query(TestHarness.java:331) [junit4] > at org.apache.solr.handler.dataimport.AbstractDataImportHandlerTestCase.runFullImport(AbstractDataImportHandlerTestCase.java:86) [junit4] > at org.apache.solr.handler.dataimport.TestTikaEntityProcessor.testTikaHTMLMapperIdentity(TestTikaEntityProcessor.java:99) [junit4] > at java.lang.Thread.run(Thread.java:745)
          Hide
          Steve Rowe added a comment -

          This version of the patch adds Uwe's assume to dataimporthandler-extra's TestTikaEntityProcessor.

          I'm running all Solr tests now with this patch and -Dtests.slow=true -Dtests.locale=tr_TR.

          Show
          Steve Rowe added a comment - This version of the patch adds Uwe's assume to dataimporthandler-extra's TestTikaEntityProcessor . I'm running all Solr tests now with this patch and -Dtests.slow=true -Dtests.locale=tr_TR .
          Hide
          Uwe Schindler added a comment -

          OK. I did not know that dataimporthandler-extras also calls TIKA...

          Show
          Uwe Schindler added a comment - OK. I did not know that dataimporthandler-extras also calls TIKA...
          Hide
          Uwe Schindler added a comment -

          Ah you already posted a patch. Thanks for testing. I have only Windows ready to use on my laptop

          Show
          Uwe Schindler added a comment - Ah you already posted a patch. Thanks for testing. I have only Windows ready to use on my laptop
          Hide
          Uwe Schindler added a comment -

          Steve Rowe: Can you commit to all 3 branches, I wanted to go sleeping? Thanks.

          Show
          Uwe Schindler added a comment - Steve Rowe : Can you commit to all 3 branches, I wanted to go sleeping? Thanks.
          Hide
          Steve Rowe added a comment -

          Steve Rowe: Can you commit to all 3 branches, I wanted to go sleeping? Thanks.

          Will do.

          Show
          Steve Rowe added a comment - Steve Rowe: Can you commit to all 3 branches, I wanted to go sleeping? Thanks. Will do.
          Hide
          Steve Rowe added a comment -

          I'm running all Solr tests now with this patch and -Dtests.slow=true -Dtests.locale=tr_TR.

          All Solr tests passed with the patch.

          Committing now.

          Show
          Steve Rowe added a comment - I'm running all Solr tests now with this patch and -Dtests.slow=true -Dtests.locale=tr_TR. All Solr tests passed with the patch. Committing now.
          Hide
          ASF subversion and git services added a comment -

          Commit 1653704 from Use account "steve_rowe" instead in branch 'dev/trunk'
          [ https://svn.apache.org/r1653704 ]

          SOLR-6991,SOLR-6387: Under Turkish locale, don't run solr-cell and dataimporthandler-extras tests that use Tika

          Show
          ASF subversion and git services added a comment - Commit 1653704 from Use account "steve_rowe" instead in branch 'dev/trunk' [ https://svn.apache.org/r1653704 ] SOLR-6991 , SOLR-6387 : Under Turkish locale, don't run solr-cell and dataimporthandler-extras tests that use Tika
          Hide
          ASF subversion and git services added a comment -

          Commit 1653706 from Use account "steve_rowe" instead in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1653706 ]

          SOLR-6991,SOLR-6387: Under Turkish locale, don't run solr-cell and dataimporthandler-extras tests that use Tika (merged trunk r1653704)

          Show
          ASF subversion and git services added a comment - Commit 1653706 from Use account "steve_rowe" instead in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1653706 ] SOLR-6991 , SOLR-6387 : Under Turkish locale, don't run solr-cell and dataimporthandler-extras tests that use Tika (merged trunk r1653704)
          Hide
          ASF subversion and git services added a comment -

          Commit 1653708 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1653708 ]

          SOLR-6991,SOLR-6387: Under Turkish locale, don't run solr-cell and dataimporthandler-extras tests that use Tika (merged trunk r1653704)

          Show
          ASF subversion and git services added a comment - Commit 1653708 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1653708 ] SOLR-6991 , SOLR-6387 : Under Turkish locale, don't run solr-cell and dataimporthandler-extras tests that use Tika (merged trunk r1653704)
          Hide
          Anshum Gupta added a comment -

          Thanks for fixing this everyone!

          Show
          Anshum Gupta added a comment - Thanks for fixing this everyone!
          Hide
          Uwe Schindler added a comment -

          Hoss: One idea to fix the whole thing could be:
          In Solr's main startup method (before any threads or whatever are spawned), I would suggest the main() method of the startup class (do we have that now, or do we still use jetty's start.jar) to do something like suggested here:

          https://issues.apache.org/jira/browse/TIKA-1526?focusedCommentId=14289182&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14289182

          It is important to do this early and at a time when the global locale setting cannot affect other stuff running in the JVM. So definitely not in the webapp itsself (thats too late):

          • check for locale, we do this like that: new Locale("tr").getLanguage().equals(Locale.getDefault().getLanguage()) (it is important to do the check like this, because otherwise its not guaranteed that it really works, especially in newer java versions!!!)
          • if its such a locale, switch to Locale.ROOT (save original) in a single-threaded environment (this is why it should be in main launcher)
          • execute a fake UNIX command, like /bin/true. You can also execute some non-existing bullshit that just fails. The call is just there to statically initalize the broken UnixProcess class. Once it is initialized correctly it works
          • switch back to saved locale
          Show
          Uwe Schindler added a comment - Hoss: One idea to fix the whole thing could be: In Solr's main startup method (before any threads or whatever are spawned), I would suggest the main() method of the startup class (do we have that now, or do we still use jetty's start.jar) to do something like suggested here: https://issues.apache.org/jira/browse/TIKA-1526?focusedCommentId=14289182&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14289182 It is important to do this early and at a time when the global locale setting cannot affect other stuff running in the JVM. So definitely not in the webapp itsself (thats too late): check for locale, we do this like that: new Locale("tr").getLanguage().equals(Locale.getDefault().getLanguage()) (it is important to do the check like this, because otherwise its not guaranteed that it really works, especially in newer java versions!!!) if its such a locale, switch to Locale.ROOT (save original) in a single-threaded environment (this is why it should be in main launcher) execute a fake UNIX command, like /bin/true. You can also execute some non-existing bullshit that just fails. The call is just there to statically initalize the broken UnixProcess class. Once it is initialized correctly it works switch back to saved locale
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development