Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-546

Add ability to create language profiles to tika-app

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.7
    • 0.10
    • cli, languageidentifier
    • None

    Description

      Since TIKA-490 it is supposed to be easy adding new language profiles to TIKA. However, currently the process involves using Nutch's NGramProfile tool and editing the output.

      We should port Nutch's profile builder to Tika and make it part of tika-app.jar:

      1. See http://wiki.apache.org/nutch/LanguageIdentifier
      2. java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...] [--maxlines=<max>] <profile-name> <filename> <encoding>

      Using --gramsizes and --maxlines, we could support both Tika-style profiles and Nutch-style profiles and thus deprecate the Nutch tool. Defaults should be --gramsizes=3 --maxlines=1000

      Attachments

        1. TIKA-546.tikhonov.18042011.PATCH
          411 kB
          Oleg Tkachenko

        Activity

          People

            chrismattmann Chris A. Mattmann
            janhoy Jan Høydahl
            Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: