Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1182

Improve error handling in LanguageDetectorConverterTool

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.8.4
    • 2.1.1
    • Language Detector
    • None

    Description

      Contrary to the docs (see below), LanguageDetectorConverterTool doesn't actually do anything at all; the class is empty.

      The following sequence of commands shows how to convert the Leipzig Corpora collection at folder leipzig-train/ to the default Language Detector format, by creating groups of 5 sentences as documents and limiting to 10000 documents per language. Them, it shuffles the result and select the first 100000 lines as train corpus and the last 20000 as evaluation corpus:

      					
      $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ -sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt
      $ perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < leipzig.txt > leipzig_shuf.txt
      $ head -100000 < leipzig_shuf.txt > leipzig.train
      $ tail -20000 < leipzig_shuf.txt > leipzig.eval
      

      Attachments

        Activity

          People

            aarora Atita Arora
            sarowe Steven Rowe
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: