OpenNLP
  1. OpenNLP
  2. OPENNLP-33

Write documentation for the document categorizer component

    Details

      Description

      Write initial documentation for the document categorizer component.

      The issue is migrated from SourceForge:
      https://sourceforge.net/tracker/?func=detail&aid=3028436&group_id=3368&atid=103368

      1. doccat_commandline_documentation.rtf
        14 kB
        Suresh Kumar Ramasamy
      2. doccat_documentation.rtf
        6 kB
        Dan Frank

        Activity

        Joern Kottmann created issue -
        Joern Kottmann made changes -
        Field Original Value New Value
        Fix Version/s tools-1.5.1-incubating [ 12315983 ]
        Joern Kottmann made changes -
        Fix Version/s tools-1.5.1-incubating [ 12315983 ]
        Hide
        Dan Frank added a comment -

        rtf of documentation for doccat - adding to docbook considered a TODO

        Show
        Dan Frank added a comment - rtf of documentation for doccat - adding to docbook considered a TODO
        Dan Frank made changes -
        Attachment doccat_documentation.rtf [ 12470065 ]
        Hide
        Joern Kottmann added a comment -

        There are a few questions inside the attached document.

        1. The maxent jar is still necessary since it contains all the maxent classes which are mostly used by the DoccatModel for serializing the embeded maxent binary model and by DocumentCategorizerME to perform the training and categorization.

        2. The training format is, one document per line, first token is the the category and all other whitespace separated tokens are document tokens. The DocumentSample constructor also expects whitespace tokenized input text.

        3. The parsing code you describe is mostly already in DocumentSampleStream, that one can parse the above described format.

        Show
        Joern Kottmann added a comment - There are a few questions inside the attached document. 1. The maxent jar is still necessary since it contains all the maxent classes which are mostly used by the DoccatModel for serializing the embeded maxent binary model and by DocumentCategorizerME to perform the training and categorization. 2. The training format is, one document per line, first token is the the category and all other whitespace separated tokens are document tokens. The DocumentSample constructor also expects whitespace tokenized input text. 3. The parsing code you describe is mostly already in DocumentSampleStream, that one can parse the above described format.
        Suresh Kumar Ramasamy made changes -
        Attachment doccat_commandline_documentation.rtf [ 12479888 ]
        Joern Kottmann made changes -
        Assignee Jörn Kottmann [ joern ]
        Joern Kottmann made changes -
        Fix Version/s tools-1.5.2-incubating [ 12316400 ]
        Hide
        Joern Kottmann added a comment -

        Thanks to Daniel Frank and Suresh Kumar Ramasamy for contributing the doccat documentation.

        Show
        Joern Kottmann added a comment - Thanks to Daniel Frank and Suresh Kumar Ramasamy for contributing the doccat documentation.
        Joern Kottmann made changes -
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Closed [ 6 ]

          People

          • Assignee:
            Joern Kottmann
            Reporter:
            Joern Kottmann
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development