There's already the second release of Tika 2 (2.1.0). Following the 2.0 release notes and the migration guide:
- Tika 2 is more modular which should allow us to build a smaller parse-tika (66 MiB in the 1.18 binary package) by dropping rarely used parsers - but users should be able to include them if they build Nutch from the sources.
- the language-identifier plugin needs to be upgraded as well (in addition to Nutch core and the parse-tika plugin). This would include or overlap with NUTCH-2449.
- to avoid that the PDF parser times out we probably want to disable the OCR by default, or at least, provide the configuration snippet for this purpose