Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2891

Upgrade to Tika 2.1

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Implemented
    • 1.18
    • 1.19
    • parser, plugin
    • None

    Description

      There's already the second release of Tika 2 (2.1.0). Following the 2.0 release notes and the migration guide:

      • Tika 2 is more modular which should allow us to build a smaller parse-tika (66 MiB in the 1.18 binary package) by dropping rarely used parsers - but users should be able to include them if they build Nutch from the sources.
      • the language-identifier plugin needs to be upgraded as well (in addition to Nutch core and the parse-tika plugin). This would include or overlap with NUTCH-2449.
      • to avoid that the PDF parser times out we probably want to disable the OCR by default, or at least, provide the configuration snippet for this purpose

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: