Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1115

Document Categorizer all events dropped

    XMLWordPrintableJSON

Details

    • Question
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.7.2
    • None
    • Doccat
    • None

    Description

      Hi all,
      I'm trying to perform my first (newbie) document categorization using italian language.
      I'm using the attached train file and i got this output:

      {{$ ./opennlp.bat DoccatTrainer -model it-doccat.bin -lang it -data "C:\Users\adepase\MPSProjects\MrJEditor\languages\MrJEditor\sandbox\source_gen\MrJEditor\sandbox\Train1.train" -encoding UTF-8
      Indexing events using cutoff of 5

      Computing event counts... done. 12 events
      Indexing... Dropped event Ok:[bow=ok]
      Dropped event Ok:[bow=tutto, bow=bene]
      Dropped event Ok:[bow=decisamente, bow=non, bow=male]
      Dropped event Ok:[bow=fantastica, bow=scelta]
      Dropped event Ok:[bow=non, bow=pensavo, bow=di, bow=poter, bow=essere, bow=così, bow=contento]
      Dropped event Ok:[bow=certamente, bow=un'ottimo, bow=risultato]
      Dropped event no:[bow=non, bow=va, bow=affatto, bow=bene]
      Dropped event no:[bow=per, bow=nulla]
      Dropped event no:[bow=niente, bow=affatto, bow=divertente]
      Dropped event no:[bow=va, bow=malissimo]
      Dropped event no:[bow=va, bow=decisamente, bow=male]
      Dropped event no:[bow=sono, bow=molto, bow=triste]
      done.
      Sorting and merging events...

      ERROR: Not enough training data
      The provided training data is not sufficient to create enough events to train a model.
      To resolve this error use more training data, if this doesn't help there might
      be some fundamental problem with the training data itself.}}

      I already found a couple of other similar issues, just saying that there are not enough lines (but I have 6 lines for each category and a cutoff of 5) or that without at least 100 lines the categorization quality is not sufficient (ok, but that's just a quality matter, it should work, with bad results, but it should work). The reason for insufficient data is that all the lines are dropped.
      I also tried with java api, same result.
      But why? What did I miss? I cannot find useful documentation...

      Thank you in advance
      Kind Regards
      Alessandro

      Attachments

        1. Train1.train
          0.3 kB
          Alessandro Depase

        Activity

          People

            Unassigned Unassigned
            alessandro.depase@libero.it Alessandro Depase
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: