Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
The Joshua Decoder toolkit is a BSD licensed Java-based statistical machine translation system hosted at Github:
Joshua takes in corpuses and trains models that can then be used to do language translation. Currently there is support for e.g., Spanisn->English, Indian dialects->English, Chinese->English, and a few others.
https://github.com/joshua-decoder/joshua/
It would be nice to build a Tika Translator on top of Joshua. There are of course several issues with this:
- the models are huge - so we'll need a separate package or Maven module, maybe tika-translate-joshua or something to release the models and we'll need to build the models. I just went through the process of building the Spanish->English one, and it still needs to be rebuilt b/c I did it wrong, but it took over a day
- there is a configuration for Joshua, and so we need some way of passing that config into the Translator. Not sure of the best way to do this.
- Joshua isn't in the Central repository. I've started a discussion on the Joshua lists about this: https://groups.google.com/forum/#!topic/joshua_support/9Y04miboUj0
Anyhoo, I've got a working patch right now with hard code stuff, and a manual install into my Maven repo for brave souls out there that want to try it.
Attachments
Issue Links
- is depended upon by
-
SOLR-8714 Implement translation contrib package for LanguageTranslationUpdateProcessor's
- Open
- links to