OK, I looked at this some more. So the Java code you contributed is ASL and Apertium's tools (and data?) is GPL v2?
The thing that puzzles me are the language pairs themselves. Why are they in pairs? Is that simply for the translation part of Apertium, and something that's ignored when you use the pair for Lucene and morphological analysis?
If I'm interested in, say, French morphological analyzer, why do I need any other language? For French, I see:
If I'm interested in French, which of the 4 above is the right one to use? The one with the highest number of lemmata?
I had a look at the Indexer and Searcher to get an idea about the usage. Those classes are really just for demonstration, right? Still, do you mind replacing the deprecated Hits object in the Searcher class?
In the README you mention this:
2. The Spanish morphological dictionary must be preprocessed in advance to remove multiword expressions:
$ java -classpath lucene-apertium-morph-2.4-dev.jar \
--dix apertium-es-ca.es.dix > apertium-es-ca.es-nomw.dix
Could you explain why the removal of multiword expressions is needed?
Is that Spanish-specific or something one needs to do regardless of the language?
4. Each file to be indexed must be preprocessed using the Apertium tools:
$ cat file.txt | apertium-destxt | lt-proc -a es-ca-nomw.automorf.bin | apertium-tagger -g -f es-ca.prob > file.pos.txt
So these are a few command-line tools that end up marking up the input text with POS? (I seem to be missing some libraries and can't compile Apterium locally to check what that this marked up file looks like).
But my main question here is whether there are Java equivalents of these command-line tools, so that one can easily use them from Java? Or is one forced to use Runtime.exec(...)?