Tika
  1. Tika
  2. TIKA-546

Add ability to create language profiles to tika-app

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.10
    • Component/s: cli, languageidentifier
    • Labels:
      None

      Description

      Since TIKA-490 it is supposed to be easy adding new language profiles to TIKA. However, currently the process involves using Nutch's NGramProfile tool and editing the output.

      We should port Nutch's profile builder to Tika and make it part of tika-app.jar:

      1. See http://wiki.apache.org/nutch/LanguageIdentifier
      2. java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...] [--maxlines=<max>] <profile-name> <filename> <encoding>

      Using --gramsizes and --maxlines, we could support both Tika-style profiles and Nutch-style profiles and thus deprecate the Nutch tool. Defaults should be --gramsizes=3 --maxlines=1000

        Activity

        Hide
        Chris A. Mattmann added a comment -

        I think Oleg took care of this, so I'm marking it as resolved. If not, please open another issue and link back. Thanks!

        Show
        Chris A. Mattmann added a comment - I think Oleg took care of this, so I'm marking it as resolved. If not, please open another issue and link back. Thanks!
        Hide
        Jan Høydahl added a comment -

        What's the state of this issue? It says "unresolved" but something is committed?

        Show
        Jan Høydahl added a comment - What's the state of this issue? It says "unresolved" but something is committed?
        Hide
        Oleg Tikhonov added a comment -

        Commited Language Profiler Builder and its test, revision 1147277. TikaCLI also changed, added the option '--create-profile. The tests also provided.

        Show
        Oleg Tikhonov added a comment - Commited Language Profiler Builder and its test, revision 1147277. TikaCLI also changed, added the option '--create-profile. The tests also provided.
        Hide
        Chris A. Mattmann added a comment -

        Will likely work on this later in the week, got a little bogged down...

        Show
        Chris A. Mattmann added a comment - Will likely work on this later in the week, got a little bogged down...
        Hide
        Joseph Vychtrle added a comment -

        Hey Chris,

        have you managed to think this through ?

        Show
        Joseph Vychtrle added a comment - Hey Chris, have you managed to think this through ?
        Hide
        Chris A. Mattmann added a comment -

        I'll take a crack at this.

        Show
        Chris A. Mattmann added a comment - I'll take a crack at this.
        Hide
        Joseph Vychtrle added a comment -

        How come that NGramProfile.java is not in Tika's trunk ? Nor it is in Nutch ...

        Show
        Joseph Vychtrle added a comment - How come that NGramProfile.java is not in Tika's trunk ? Nor it is in Nutch ...
        Hide
        Joseph Vychtrle added a comment -

        Guys is anybody going to commit the patch ? Or what is the resolution for generating language profiles via NGramProfile and adding the ngram file on the classpath + overriding properties ? Is this all that needs to be done to add a language profile? I'd commit that patch to my local copy then.

        Show
        Joseph Vychtrle added a comment - Guys is anybody going to commit the patch ? Or what is the resolution for generating language profiles via NGramProfile and adding the ngram file on the classpath + overriding properties ? Is this all that needs to be done to add a language profile? I'd commit that patch to my local copy then.
        Hide
        Oleg Tikhonov added a comment -

        1. Added NGramProfile
        2. Added an option into TikaCLI - --createProfile, default values: gramsize = 3. maxlines = 1000. Currently there is no option to change them, 'cause LanguageProfile implementation.
        4. Added NGramProfileTest
        5. Added TikaCLI test
        Could anybody have a look at the patch?

        Show
        Oleg Tikhonov added a comment - 1. Added NGramProfile 2. Added an option into TikaCLI - --createProfile, default values: gramsize = 3. maxlines = 1000. Currently there is no option to change them, 'cause LanguageProfile implementation. 4. Added NGramProfileTest 5. Added TikaCLI test Could anybody have a look at the patch?
        Hide
        Sami Siren added a comment -

        Do we build the "LanguageProfilerBuilder" from Nutch code here locally and ship it as binary package/library or as part of mvn install task/ ant task?

        I would just do what Jan suggested = get the relevant source files from Nutch, modify them as needed (like remove dependencies etc) and commit this into Tika svn repository.

        Show
        Sami Siren added a comment - Do we build the "LanguageProfilerBuilder" from Nutch code here locally and ship it as binary package/library or as part of mvn install task/ ant task? I would just do what Jan suggested = get the relevant source files from Nutch, modify them as needed (like remove dependencies etc) and commit this into Tika svn repository.
        Hide
        Oleg Tikhonov added a comment -

        I would like to discuss here a little bit. What is an appropriate way to do it? Do we build the "LanguageProfilerBuilder" from Nutch code here locally and ship it as binary package/library or as part of mvn install task/ ant task? Nutch Language package by itself depends on other libraries.

        Show
        Oleg Tikhonov added a comment - I would like to discuss here a little bit. What is an appropriate way to do it? Do we build the "LanguageProfilerBuilder" from Nutch code here locally and ship it as binary package/library or as part of mvn install task/ ant task? Nutch Language package by itself depends on other libraries.

          People

          • Assignee:
            Chris A. Mattmann
            Reporter:
            Jan Høydahl
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development