Tika
  1. Tika
  2. TIKA-546

Add ability to create language profiles to tika-app

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.10
    • Component/s: cli, languageidentifier
    • Labels:
      None

      Description

      Since TIKA-490 it is supposed to be easy adding new language profiles to TIKA. However, currently the process involves using Nutch's NGramProfile tool and editing the output.

      We should port Nutch's profile builder to Tika and make it part of tika-app.jar:

      1. See http://wiki.apache.org/nutch/LanguageIdentifier
      2. java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...] [--maxlines=<max>] <profile-name> <filename> <encoding>

      Using --gramsizes and --maxlines, we could support both Tika-style profiles and Nutch-style profiles and thus deprecate the Nutch tool. Defaults should be --gramsizes=3 --maxlines=1000

        Activity

        Jukka Zitting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Chris A. Mattmann made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 0.10 [ 12313535 ]
        Resolution Fixed [ 1 ]
        Chris A. Mattmann made changes -
        Assignee Chris A. Mattmann [ chrismattmann ]
        Oleg Tikhonov made changes -
        Field Original Value New Value
        Attachment TIKA-546.tikhonov.18042011.PATCH [ 12476596 ]
        Jan Høydahl created issue -

          People

          • Assignee:
            Chris A. Mattmann
            Reporter:
            Jan Høydahl
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development