Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1343

Create a Tika Translator implementation that uses JoshuaDecoder

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.15
    • translation
    • None

    Description

      The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine translation system hosted at Github:

      http://joshua-decoder.org/

      Joshua takes in corpuses and trains models that can then be used to do language translation. Currently there is support for e.g., Spanisn->English, Indian dialects->English, Chinese->English, and a few others.

      https://github.com/joshua-decoder/joshua/

      It would be nice to build a Tika Translator on top of Joshua. There are of course several issues with this:

      • the models are huge - so we'll need a separate package or Maven module, maybe tika-translate-joshua or something to release the models and we'll need to build the models. I just went through the process of building the Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, but it took over a day
      • there is a configuration for Joshua, and so we need some way of passing that config into the Translator. Not sure of the best way to do this.
      • Joshua isn't in the Central repository. I've started a discussion on the Joshua lists about this: https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0

      Anyhoo, I've got a working patch right now with hard code stuff, and a manual install into my Maven repo for brave souls out there that want to try it.

      Attachments

        Issue Links

          Activity

            People

              lewismc Lewis John McGibbney
              chrismattmann Chris A. Mattmann
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: