Description
There's already the second release of Tika 2 (2.1.0). Following the 2.0 release notes and the migration guide:
- Tika 2 is more modular which should allow us to build a smaller parse-tika (66 MiB in the 1.18 binary package) by dropping rarely used parsers - but users should be able to include them if they build Nutch from the sources.
- the language-identifier plugin needs to be upgraded as well (in addition to Nutch core and the parse-tika plugin). This would include or overlap with
NUTCH-2449. - to avoid that the PDF parser times out we probably want to disable the OCR by default, or at least, provide the configuration snippet for this purpose
Attachments
Issue Links
- incorporates
-
NUTCH-2449 Usage of Tika LanguageIdentifier in language-identifier plugin
- Closed
- supercedes
-
NUTCH-2860 Upgrade to Tika 1.26
- Closed
- links to