Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1526

ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can still use non-external parsers



    • Wish
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • None
    • None


      the JDK has numerous pain points regarding the Turkish locale, "posix_spawn" lowercasing being one of them...


      As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled & configured by default in Tika, and uses ExternalParser.check to see if tesseract is available – but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so...

        [junit4]    > Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform.
        [junit4]    > 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
        [junit4]    > 	at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
        [junit4]    > 	at java.security.AccessController.doPrivileged(Native Method)
        [junit4]    > 	at java.lang.UNIXProcess.<clinit>(UNIXProcess.java:92)
        [junit4]    > 	at java.lang.ProcessImpl.start(ProcessImpl.java:130)
        [junit4]    > 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        [junit4]    > 	at java.lang.Runtime.exec(Runtime.java:620)
        [junit4]    > 	at java.lang.Runtime.exec(Runtime.java:485)
        [junit4]    > 	at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
        [junit4]    > 	at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
        [junit4]    > 	at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
        [junit4]    > 	at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
        [junit4]    > 	at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
        [junit4]    > 	at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
        [junit4]    > 	at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
        [junit4]    > 	at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
        [junit4]    > 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        [junit4]    > 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

      ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed.

      It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge...

      } catch (Error err) {
        if (err.getMessage() != null && (err.getMessage().contains("posix_spawn") || err.getMessage().contains("UNIXProcess"))) {
          log.warn("Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): " + err.getMessage());
          return "(error executing: " + cmd + ")";

      ...but with Tika, it might be better for all ExternalParsers to just "opt out" as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture)


        Issue Links



              Unassigned Unassigned
              hossman Chris M. Hostetter
              0 Vote for this issue
              5 Start watching this issue