Tika
  1. Tika
  2. TIKA-546

Add ability to create language profiles to tika-app

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.10
    • Component/s: cli, languageidentifier
    • Labels:
      None

      Description

      Since TIKA-490 it is supposed to be easy adding new language profiles to TIKA. However, currently the process involves using Nutch's NGramProfile tool and editing the output.

      We should port Nutch's profile builder to Tika and make it part of tika-app.jar:

      1. See http://wiki.apache.org/nutch/LanguageIdentifier
      2. java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...] [--maxlines=<max>] <profile-name> <filename> <encoding>

      Using --gramsizes and --maxlines, we could support both Tika-style profiles and Nutch-style profiles and thus deprecate the Nutch tool. Defaults should be --gramsizes=3 --maxlines=1000

        Activity

          People

          • Assignee:
            Chris A. Mattmann
            Reporter:
            Jan Høydahl
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development