Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-546

Add ability to create language profiles to tika-app

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.10
    • Component/s: cli, languageidentifier
    • Labels:
      None

      Description

      Since TIKA-490 it is supposed to be easy adding new language profiles to TIKA. However, currently the process involves using Nutch's NGramProfile tool and editing the output.

      We should port Nutch's profile builder to Tika and make it part of tika-app.jar:

      1. See http://wiki.apache.org/nutch/LanguageIdentifier
      2. java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...] [--maxlines=<max>] <profile-name> <filename> <encoding>

      Using --gramsizes and --maxlines, we could support both Tika-style profiles and Nutch-style profiles and thus deprecate the Nutch tool. Defaults should be --gramsizes=3 --maxlines=1000

        Attachments

          Activity

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              janhoy Jan Høydahl
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: