In fact this is not TIKA's issue and not new, a lot of stuff around Hadoop in Solr fails with Turkish!
...my point is: it's new to Solr.
in all other cases where POSIX_SPAWN impacts Solr, we either:
- deal with it in the solr code, so we give a meaningful error to the user explaining the problem (ie: SystemInfoHandler)
- it's in an optional feature that NEVER worked with turkish – ie: the hadoop / morephlines contribs, from the first version it was available in Solr, would not work with turkish locale
...in this case, we're talking about an existing solr feature, that has previously worked fine if you run older Solr with turkish, and now when upgrading to 5.0 you're going to get a weird error message.
if there's nothing better we can do keep the ExtractionRequestHandler working or users who upgrade (even if they run with turkish) then i'm fine with assumes in the tests and notes in the docs ... i was just hoping you'd have a better idea.
in particular: I'm still wondering if we can leverage the classpath in a way to override the "default" TesseractOCRConfig.properties file in the tika-parsers jar with our own version that prevents tesseract from being used. (i agree it's not worth switching to explicitly whitelisting the parsers in Solr code, but is there an easy way to blacklist this parser and/or other parsers we know are problematic?)