Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
1.8.4
-
None
Description
Contrary to the docs (see below), LanguageDetectorConverterTool doesn't actually do anything at all; the class is empty.
The following sequence of commands shows how to convert the Leipzig Corpora collection at folder leipzig-train/ to the default Language Detector format, by creating groups of 5 sentences as documents and limiting to 10000 documents per language. Them, it shuffles the result and select the first 100000 lines as train corpus and the last 20000 as evaluation corpus:
$ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ -sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt $ perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < leipzig.txt > leipzig_shuf.txt $ head -100000 < leipzig_shuf.txt > leipzig.train $ tail -20000 < leipzig_shuf.txt > leipzig.eval