[OPENNLP-1182] Improve error handling in LanguageDetectorConverterTool - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.8.4
Fix Version/s: 2.1.1
Component/s: Language Detector
Labels:
None

Description

Contrary to the docs (see below), LanguageDetectorConverterTool doesn't actually do anything at all; the class is empty.

The following sequence of commands shows how to convert the Leipzig Corpora collection at folder leipzig-train/ to the default Language Detector format, by creating groups of 5 sentences as documents and limiting to 10000 documents per language. Them, it shuffles the result and select the first 100000 lines as train corpus and the last 20000 as evaluation corpus:
					
$ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ -sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt
$ perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < leipzig.txt > leipzig_shuf.txt
$ head -100000 < leipzig_shuf.txt > leipzig.train
$ tail -20000 < leipzig_shuf.txt > leipzig.eval

Attachments

Activity

People

Assignee:: Atita Arora

Reporter:: Steven Rowe

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Jan/18 20:58

Updated:: 03/Jan/23 14:56

Resolved:: 03/Jan/23 14:55