[OPENNLP-819] Leipzig corpus reader should be able to train a language identification model - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.7.0
Component/s: Doccat
Labels:
None

Description

In its current state the Leipzig corpus reader can only read from one language file. In order to create a model that can detect many languages all the input files must be converted and merged together.

It would be much easier to train a language identification model if the corpus reader could just read many sentences files form a directory.

This issue will change the Leipzig reader to read from all sentences file in a specified directory. The language category should be extracted from the file name itself.

Attachments

Activity

People

Assignee:: Jörn Kottmann

Reporter:: Jörn Kottmann

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Sep/15 12:11

Updated:: 17/Sep/15 13:02

Resolved:: 17/Sep/15 13:02