Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-819

Leipzig corpus reader should be able to train a language identification model

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 1.7.0
    • Doccat
    • None

    Description

      In its current state the Leipzig corpus reader can only read from one language file. In order to create a model that can detect many languages all the input files must be converted and merged together.

      It would be much easier to train a language identification model if the corpus reader could just read many sentences files form a directory.

      This issue will change the Leipzig reader to read from all sentences file in a specified directory. The language category should be extracted from the file name itself.

      Attachments

        Activity

          People

            joern Jörn Kottmann
            joern Jörn Kottmann
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: