Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2891

Upgrade to Tika 2.1

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.18
    • Fix Version/s: 1.19
    • Component/s: parser, plugin
    • Labels:
      None

      Description

      There's already the second release of Tika 2 (2.1.0). Following the 2.0 release notes and the migration guide:

      • Tika 2 is more modular which should allow us to build a smaller parse-tika (66 MiB in the 1.18 binary package) by dropping rarely used parsers - but users should be able to include them if they build Nutch from the sources.
      • the language-identifier plugin needs to be upgraded as well (in addition to Nutch core and the parse-tika plugin). This would include or overlap with NUTCH-2449.
      • to avoid that the PDF parser times out we probably want to disable the OCR by default, or at least, provide the configuration snippet for this purpose

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                snagel Sebastian Nagel
                Reporter:
                snagel Sebastian Nagel
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: