TIKA-490 it is supposed to be easy adding new language profiles to TIKA. However, currently the process involves using Nutch's NGramProfile tool and editing the output.
We should port Nutch's profile builder to Tika and make it part of tika-app.jar:
- See http://wiki.apache.org/nutch/LanguageIdentifier
- java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...] [--maxlines=<max>] <profile-name> <filename> <encoding>
Using --gramsizes and --maxlines, we could support both Tika-style profiles and Nutch-style profiles and thus deprecate the Nutch tool. Defaults should be --gramsizes=3 --maxlines=1000