Details

      Description

      Proposed patch refactors CLI tools and simplifies the code by introducing hierarchy and removing a lot of code duplication. It also introduces better error and help messages, including help for formats and listing available formats in various tools, which are now able to work with formats directly. This, in turn, eliminates the need to keep converted files on disk.

      1. opennlp-cmdline-package-class-structure.png
        51 kB
        Aliaksandr Autayeu
      2. open-nlp-cli-package.png
        60 kB
        Aliaksandr Autayeu
      3. 0017-added-direct-format-support-to-CLI-tools-and-formats.patch
        397 kB
        Aliaksandr Autayeu
      4. 0016-CLI-tools-and-formats-refactored.patch
        250 kB
        Aliaksandr Autayeu

        Activity

        Hide
        Joern Kottmann added a comment -

        I am just curious. You said that you created a (git?) branch for this issue. Is it public, e.g on github or so?

        Show
        Joern Kottmann added a comment - I am just curious. You said that you created a (git?) branch for this issue. Is it public, e.g on github or so?
        Hide
        Aliaksandr Autayeu added a comment -

        I create branch for every patch I submit. I keep them in the cloned (from Apache Git mirror) OpenNLP repo on my machine. Why should I share them? Does it matter if code is accepted only here, only through patches. May be I miss something? Can you elaborate or explain?

        Show
        Aliaksandr Autayeu added a comment - I create branch for every patch I submit. I keep them in the cloned (from Apache Git mirror) OpenNLP repo on my machine. Why should I share them? Does it matter if code is accepted only here, only through patches. May be I miss something? Can you elaborate or explain?
        Hide
        Joern Kottmann added a comment -

        No it does not matter and that is fine. I just though that having the ability to easily make a branch which is publicly available could be very interesting for some things. As a committer I can always do that with subversion, but when I got it right this is possible for everyone with git. And that is why I asked how you are doing it.

        Show
        Joern Kottmann added a comment - No it does not matter and that is fine. I just though that having the ability to easily make a branch which is publicly available could be very interesting for some things. As a committer I can always do that with subversion, but when I got it right this is possible for everyone with git. And that is why I asked how you are doing it.
        Hide
        Aliaksandr Autayeu added a comment -

        For example, for which things?

        Anyway, if branch integration would make patch integration easier (oh, pull requests... , I might do that Subversion branches are cheap to do, expensive to merge. Git merges better. Afterall, the only reason to have a branch is to integrate it later on (or discard). Anyway, I'm open to advice which will make your integration life easier.

        Show
        Aliaksandr Autayeu added a comment - For example, for which things? Anyway, if branch integration would make patch integration easier (oh, pull requests... , I might do that Subversion branches are cheap to do, expensive to merge. Git merges better. Afterall, the only reason to have a branch is to integrate it later on (or discard). Anyway, I'm open to advice which will make your integration life easier.
        Hide
        Joern Kottmann added a comment -

        I once tried 3 - 4 different strategies to add multi-threading support to the maxent package, doing this just in different branches on a platform like github would have been nice.

        Next thing where I will use it for are hash features instead of our string features. I would like to make a branch which is public to get comments from others on that.

        Show
        Joern Kottmann added a comment - I once tried 3 - 4 different strategies to add multi-threading support to the maxent package, doing this just in different branches on a platform like github would have been nice. Next thing where I will use it for are hash features instead of our string features. I would like to make a branch which is public to get comments from others on that.
        Hide
        Joern Kottmann added a comment -

        How does this new format support work?

        Lets say I am training the name finder, thats how I do it:
        bin/opennlp TokenNameFinderCrossValidator -lang x -data x.train -featuregen xyz.xml -params xyz.txt

        After applying your patch I get this error and help message:
        Error: Missing mandatory parameter: -format
        Usage: opennlp TokenNameFinderCrossValidator [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile] [-params paramsFile] [-iterations num] [-cutoff num] [-misclassified true|false] [-folds num] [-detailedF true|false] -format formatName

        Arguments description:
        -resources resourcesDir
        The resources directory
        -type modelType
        The type of the token name finder model
        -featuregen featuregenFile
        The feature generator descriptor file
        -params paramsFile
        Training parameters file.
        -iterations num
        specifies the number of training iterations. It is ignored if a parameters file is passed.
        -cutoff num
        specifies the min number of times a feature must be seen. It is ignored if a parameters file is passed.
        -misclassified true|false
        if true will print false negatives and false positives
        -folds num
        The number of folds. Default is 10
        -detailedF true|false
        if true will print detailed FMeasure results
        -format formatName
        the format of the data, for example conllx, defaults to opennlp. Format might have its own parameters.

        opennlp format usage: -lang language -data sampleData [-encoding charsetName]

        Arguments description:
        -lang language
        specifies the language which is being processed.
        -data sampleData
        the data to be used
        -encoding charsetName
        specifies the encoding which should be used for reading and writing text. If not specified the system default will be used.

        bionlp2004 format usage: -types DNA,protein,cell_type,cell_line,RNA -lang language -data sampleData [-encoding charsetName]

        Arguments description:
        -types DNA,protein,cell_type,cell_line,RNA
        -lang language
        specifies the language which is being processed.
        -data sampleData
        the data to be used
        -encoding charsetName
        specifies the encoding which should be used for reading and writing text. If not specified the system default will be used.

        conll03 format usage: -lang en|de -types per,loc,org,misc -data sampleData [-encoding charsetName]

        Arguments description:
        -lang en|de
        -types per,loc,org,misc
        -data sampleData
        the data to be used
        -encoding charsetName
        specifies the encoding which should be used for reading and writing text. If not specified the system default will be used.

        conll02 format usage: -lang es|nl -types per,loc,org,misc -data sampleData [-encoding charsetName]

        Arguments description:
        -lang es|nl
        -types per,loc,org,misc
        -data sampleData
        the data to be used
        -encoding charsetName
        specifies the encoding which should be used for reading and writing text. If not specified the system default will be used.

        ad format usage: -lang language -data sampleData [-encoding charsetName]

        -----------

        Specifying -format didn't help, do I need to have certain order? And shouldn't it default to opennlp?

        I don't mind having support for a format directly, but it should not make the default case difficult to use.

        One disadvantage of this feature is, that it is easy to make mistakes which cannot be noticed,
        e.g. encoding issues, problems in the format, etc. That may sound a bit strange, but some bugs in the UIMA
        training integration came only to my attention after I added a feature to dump the training data to a file in the opennlp format.

        Anyway, we still have the converters for debugging or the case that a user just wants to see the training data.

        Show
        Joern Kottmann added a comment - How does this new format support work? Lets say I am training the name finder, thats how I do it: bin/opennlp TokenNameFinderCrossValidator -lang x -data x.train -featuregen xyz.xml -params xyz.txt After applying your patch I get this error and help message: Error: Missing mandatory parameter: -format Usage: opennlp TokenNameFinderCrossValidator [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile] [-params paramsFile] [-iterations num] [-cutoff num] [-misclassified true|false] [-folds num] [-detailedF true|false] -format formatName Arguments description: -resources resourcesDir The resources directory -type modelType The type of the token name finder model -featuregen featuregenFile The feature generator descriptor file -params paramsFile Training parameters file. -iterations num specifies the number of training iterations. It is ignored if a parameters file is passed. -cutoff num specifies the min number of times a feature must be seen. It is ignored if a parameters file is passed. -misclassified true|false if true will print false negatives and false positives -folds num The number of folds. Default is 10 -detailedF true|false if true will print detailed FMeasure results -format formatName the format of the data, for example conllx, defaults to opennlp. Format might have its own parameters. opennlp format usage: -lang language -data sampleData [-encoding charsetName] Arguments description: -lang language specifies the language which is being processed. -data sampleData the data to be used -encoding charsetName specifies the encoding which should be used for reading and writing text. If not specified the system default will be used. bionlp2004 format usage: -types DNA,protein,cell_type,cell_line,RNA -lang language -data sampleData [-encoding charsetName] Arguments description: -types DNA,protein,cell_type,cell_line,RNA -lang language specifies the language which is being processed. -data sampleData the data to be used -encoding charsetName specifies the encoding which should be used for reading and writing text. If not specified the system default will be used. conll03 format usage: -lang en|de -types per,loc,org,misc -data sampleData [-encoding charsetName] Arguments description: -lang en|de -types per,loc,org,misc -data sampleData the data to be used -encoding charsetName specifies the encoding which should be used for reading and writing text. If not specified the system default will be used. conll02 format usage: -lang es|nl -types per,loc,org,misc -data sampleData [-encoding charsetName] Arguments description: -lang es|nl -types per,loc,org,misc -data sampleData the data to be used -encoding charsetName specifies the encoding which should be used for reading and writing text. If not specified the system default will be used. ad format usage: -lang language -data sampleData [-encoding charsetName] ----------- Specifying -format didn't help, do I need to have certain order? And shouldn't it default to opennlp? I don't mind having support for a format directly, but it should not make the default case difficult to use. One disadvantage of this feature is, that it is easy to make mistakes which cannot be noticed, e.g. encoding issues, problems in the format, etc. That may sound a bit strange, but some bugs in the UIMA training integration came only to my attention after I added a feature to dump the training data to a file in the opennlp format. Anyway, we still have the converters for debugging or the case that a user just wants to see the training data.
        Hide
        Joern Kottmann added a comment -

        Do you know a cli tool which has a similar problem?
        We need to adjust the possible arguments based on the format the user is choosing.

        Just some idea, but couldn't we do it like this:
        bin/opennlp TokenNameFinderCrossValidator.bionlp2004 -data ...

        and the default is just like it is now:
        bin/opennlp TokenNameFinderCrossValidator -data ...

        Maybe the format separation with the dot looks a bit stupid,
        I will continue to think about it.

        Show
        Joern Kottmann added a comment - Do you know a cli tool which has a similar problem? We need to adjust the possible arguments based on the format the user is choosing. Just some idea, but couldn't we do it like this: bin/opennlp TokenNameFinderCrossValidator.bionlp2004 -data ... and the default is just like it is now: bin/opennlp TokenNameFinderCrossValidator -data ... Maybe the format separation with the dot looks a bit stupid, I will continue to think about it.
        Hide
        Aliaksandr Autayeu added a comment -

        I have tested it with POS tagger and Tokenizer (and with my own formats as well, to avoid conversion, but I didn't include the support of third-party formats in this patch). Here is the command for conllx, which I also used for testing:

        opennlp POSTaggerTrainer -model pt.bin -format conllx -data portuguese_bosque_train.conll -encoding UTF-8 -lang pt

        The idea is that first go the parameters which belong to the training itself, such as model, iterations, etc. Second, you choose the format of your data and specify all that relates to the data: location (might be a database...), encoding, languages, etc... Of course, this may vary from format to format and therefore I list the formats available and their help separately.

        I was thinking about opennlp as a default format, but didn't find an elegant solution, which applies well to all tools. As far as I know, the changes to the default case is
        1) -format opennlp
        2) putting format related parameters after the format (so they are passed in the factory)

        One of the reasons why I did not go for full backward compatibility (although it can be done) and making opennlp format the default one is that I saw a trend toward separating parameters anyway: those iterations and cutoffs being grouped, support for the parameter file... So there are already groups of parameters which are logically different and even validated in different places, therefore having an explicit separator between them will, for examepl, make users aware of these groups and might ease understanding of possible parameters values and error messages.

        ====
        One disadvantage of this feature is, that it is easy to make mistakes which cannot be noticed,
        e.g. encoding issues, problems in the format, etc.
        ===
        Can you elaborate exactly how this is a disadvantage of having direct format support? I agree that conversion might reveal something (see below), but that's the advantage of the conversion, not the disadvantage of having direct format support.

        ====
        That may sound a bit strange, but some bugs in the UIMA
        training integration came only to my attention after I added a feature to dump the training data to a file in the opennlp format.
        ====
        Here I fully understand and agree with you, it is aligned with my own experience. Format conversion itself often reveals some bugs in the data, or in the parser, whether on the source or the target side. When possible, I even did a round-trip conversion for the datasets I have just to discover some special cases in the data.

        ===
        Do you know a cli tool which has a similar problem?
        ===
        I'm sorry, I didn't get you here. Which problem do you refer here to?

        ===
        We need to adjust the possible arguments based on the format the user is choosing.
        ===
        Yes, that's true and the factory for the format checks the parameters, printing an error message. I incorporated here your propagation of validate....Loudly. Did I misunderstand something?

        ===
        Just some idea, but couldn't we do it like this:
        bin/opennlp TokenNameFinderCrossValidator.bionlp2004 -data ...
        ===
        In general, implicit format choice removes the need of 1 extra argument. And although I like it and thought of it, the problem is that sometimes it is not possible to rename the files - they are on the CD-ROM, or read-only network share, or in the database which has a JDBC URL.

        Oops, now I get you on the last point. I left the paragraph above, it might illustrate some of my thoughts.
        ===
        bin/opennlp TokenNameFinderCrossValidator.bionlp2004 -data ...
        ===
        Yes, adding the format to the tool name is an option. It'll change the parser a bit and will introduce two different types of parameters. But if the goal is to maximize the backwards compatibility (which was broken anyway with the renaming of -model-type to -type) then it might be considered.

        Let me know if the patch needs elaboration.

        Also, if you have ideas on how to ease the integration of the formats (checking, validation), to make it easier to work with them - I'm interested. But yes, in general there are converters for debugging or data screening.

        Show
        Aliaksandr Autayeu added a comment - I have tested it with POS tagger and Tokenizer (and with my own formats as well, to avoid conversion, but I didn't include the support of third-party formats in this patch). Here is the command for conllx, which I also used for testing: opennlp POSTaggerTrainer -model pt.bin -format conllx -data portuguese_bosque_train.conll -encoding UTF-8 -lang pt The idea is that first go the parameters which belong to the training itself, such as model, iterations, etc. Second, you choose the format of your data and specify all that relates to the data: location (might be a database...), encoding, languages, etc... Of course, this may vary from format to format and therefore I list the formats available and their help separately. I was thinking about opennlp as a default format, but didn't find an elegant solution, which applies well to all tools. As far as I know, the changes to the default case is 1) -format opennlp 2) putting format related parameters after the format (so they are passed in the factory) One of the reasons why I did not go for full backward compatibility (although it can be done) and making opennlp format the default one is that I saw a trend toward separating parameters anyway: those iterations and cutoffs being grouped, support for the parameter file... So there are already groups of parameters which are logically different and even validated in different places, therefore having an explicit separator between them will, for examepl, make users aware of these groups and might ease understanding of possible parameters values and error messages. ==== One disadvantage of this feature is, that it is easy to make mistakes which cannot be noticed, e.g. encoding issues, problems in the format, etc. === Can you elaborate exactly how this is a disadvantage of having direct format support? I agree that conversion might reveal something (see below), but that's the advantage of the conversion, not the disadvantage of having direct format support. ==== That may sound a bit strange, but some bugs in the UIMA training integration came only to my attention after I added a feature to dump the training data to a file in the opennlp format. ==== Here I fully understand and agree with you, it is aligned with my own experience. Format conversion itself often reveals some bugs in the data, or in the parser, whether on the source or the target side. When possible, I even did a round-trip conversion for the datasets I have just to discover some special cases in the data. === Do you know a cli tool which has a similar problem? === I'm sorry, I didn't get you here. Which problem do you refer here to? === We need to adjust the possible arguments based on the format the user is choosing. === Yes, that's true and the factory for the format checks the parameters, printing an error message. I incorporated here your propagation of validate....Loudly. Did I misunderstand something? === Just some idea, but couldn't we do it like this: bin/opennlp TokenNameFinderCrossValidator.bionlp2004 -data ... === In general, implicit format choice removes the need of 1 extra argument. And although I like it and thought of it, the problem is that sometimes it is not possible to rename the files - they are on the CD-ROM, or read-only network share, or in the database which has a JDBC URL. Oops, now I get you on the last point. I left the paragraph above, it might illustrate some of my thoughts. === bin/opennlp TokenNameFinderCrossValidator.bionlp2004 -data ... === Yes, adding the format to the tool name is an option. It'll change the parser a bit and will introduce two different types of parameters. But if the goal is to maximize the backwards compatibility (which was broken anyway with the renaming of -model-type to -type) then it might be considered. Let me know if the patch needs elaboration. Also, if you have ideas on how to ease the integration of the formats (checking, validation), to make it easier to work with them - I'm interested. But yes, in general there are converters for debugging or data screening.
        Hide
        Joern Kottmann added a comment -

        The iteration and cutoff parameter will be removed. They are only there for backward compatibility.

        The notion of -param value does most often not rely on a certain parameter oder, and I believe having this is highly confusing to a user.

        Should we implement the solution where to format is added to the tool name?
        I think that has a couple of advantages
        1. I only get help for parameters I need, and I can still see the help message while typing my command. That is not possible when the help message is a few screens long.
        2. Parameter order does not matter
        3. It works as it did before, default will be opennlp format

        What do you think?

        ===
        In general, implicit format choice removes the need of 1 extra argument. And although I like it and thought of it, the problem is that sometimes it is not possible to rename the files - they are on the CD-ROM, or read-only network share, or in the database which has a JDBC URL.
        ===

        I don't get your point here. Why do we need to rename the files? I just need to provide a path to a file (or a directory).

        Show
        Joern Kottmann added a comment - The iteration and cutoff parameter will be removed. They are only there for backward compatibility. The notion of -param value does most often not rely on a certain parameter oder, and I believe having this is highly confusing to a user. Should we implement the solution where to format is added to the tool name? I think that has a couple of advantages 1. I only get help for parameters I need, and I can still see the help message while typing my command. That is not possible when the help message is a few screens long. 2. Parameter order does not matter 3. It works as it did before, default will be opennlp format What do you think? === In general, implicit format choice removes the need of 1 extra argument. And although I like it and thought of it, the problem is that sometimes it is not possible to rename the files - they are on the CD-ROM, or read-only network share, or in the database which has a JDBC URL. === I don't get your point here. Why do we need to rename the files? I just need to provide a path to a file (or a directory).
        Hide
        Joern Kottmann added a comment -

        Here are my code review comments:

        • Remove author tags, code is owned by the community and not individuals, we list contributors and committers on the team page.
        • Don't rename classes which are part of our public API, breaks backward compatibility (NameSampleDataStream, WordTagSampleStream)
        • Should AbstractCLITool.getHelp return this: "Usage: " + CLI.CMD + " " + getName() ?
        • Why was InvalidFormatException removed from some ModelLoader sub-classes?
        • Documentation should be updated as well
        Show
        Joern Kottmann added a comment - Here are my code review comments: Remove author tags, code is owned by the community and not individuals, we list contributors and committers on the team page. Don't rename classes which are part of our public API, breaks backward compatibility (NameSampleDataStream, WordTagSampleStream) Should AbstractCLITool.getHelp return this: "Usage: " + CLI.CMD + " " + getName() ? Why was InvalidFormatException removed from some ModelLoader sub-classes? Documentation should be updated as well
        Hide
        Aliaksandr Autayeu added a comment -

        ===
        I don't get your point here. Why do we need to rename the files? I just need to provide a path to a file (or a directory).
        ===
        I thought the format was to be specified as file extension. Overlooked that it pertains to the tool name.

        ... format is added to the tool name?
        These are good reasons. Thank you for advice. Done.

        • Remove author tags, code is owned by the community and not individuals, we list contributors and committers on the team page.
          Done.
        • Don't rename classes which are part of our public API, breaks backward compatibility (NameSampleDataStream, WordTagSampleStream)
          Done. They did not follow the pattern, that's why I did that. Having class names follow the pattern might be worth breaking the compatibility.
        • Should AbstractCLITool.getHelp return this: "Usage: " + CLI.CMD + " " + getName() ?
          Done with basicHelp
        • Why was InvalidFormatException removed from some ModelLoader sub-classes?
          IOException is a more general one, already in throws: public class InvalidFormatException extends IOException {
        • Documentation should be updated as well
          Done + JIRAs 215 and 218.

        Some extra notes:

        I have updated documentation a bit to resolve JIRAs 215 and 218 and have updated examples which used deprecated methods to use new ones. I probably should have done this in separate patches, but I've got carried away. Some parameters descriptions have been updated to make the help uniform and coherent. Some inappropriately used <programlisting> have been changed to <screen>.

        In some places tool return codes have been updated to return 0 (no mistakes, OK) when help is printed, and -1 on mistakes (parameters or else)

        In some places I have removed Javadocs where they duplicated those already written on interfaces.

        To a TerminateToolException was added a convention on error codes and an attempt was made to convert some direct print() calls to centralize error processing.

        Show
        Aliaksandr Autayeu added a comment - === I don't get your point here. Why do we need to rename the files? I just need to provide a path to a file (or a directory). === I thought the format was to be specified as file extension. Overlooked that it pertains to the tool name. ... format is added to the tool name? These are good reasons. Thank you for advice. Done. Remove author tags, code is owned by the community and not individuals, we list contributors and committers on the team page. Done. Don't rename classes which are part of our public API, breaks backward compatibility (NameSampleDataStream, WordTagSampleStream) Done. They did not follow the pattern, that's why I did that. Having class names follow the pattern might be worth breaking the compatibility. Should AbstractCLITool.getHelp return this: "Usage: " + CLI.CMD + " " + getName() ? Done with basicHelp Why was InvalidFormatException removed from some ModelLoader sub-classes? IOException is a more general one, already in throws: public class InvalidFormatException extends IOException { Documentation should be updated as well Done + JIRAs 215 and 218. Some extra notes: I have updated documentation a bit to resolve JIRAs 215 and 218 and have updated examples which used deprecated methods to use new ones. I probably should have done this in separate patches, but I've got carried away. Some parameters descriptions have been updated to make the help uniform and coherent. Some inappropriately used <programlisting> have been changed to <screen>. In some places tool return codes have been updated to return 0 (no mistakes, OK) when help is printed, and -1 on mistakes (parameters or else) In some places I have removed Javadocs where they duplicated those already written on interfaces. To a TerminateToolException was added a convention on error codes and an attempt was made to convert some direct print() calls to centralize error processing.
        Hide
        Aliaksandr Autayeu added a comment -

        I have attached an updated patch. Please, review and integrate.

        Show
        Aliaksandr Autayeu added a comment - I have attached an updated patch. Please, review and integrate.
        Hide
        Joern Kottmann added a comment -

        Sorry, I am late with my review. Looks great. I got a bit confused with all these AbstractXXX classes and interfaces. We eventually should simplify that a bit.

        I will do a bit more testing before I commit it.

        Show
        Joern Kottmann added a comment - Sorry, I am late with my review. Looks great. I got a bit confused with all these AbstractXXX classes and interfaces. We eventually should simplify that a bit. I will do a bit more testing before I commit it.
        Hide
        Joern Kottmann added a comment -

        I am proposing the following change to simplify these Abstract classes and interfaces.
        We make one abstract CmdLineTool class which can get one parameter. The parameter could be used for anything, and must not always be a format.

        abstract class CmdLineTool {
        void getName()

        {..}

        //default implementation which uses the class name to get the name
        String getShortDescription()

        {return "";}

        String[] supportedFormats(); // returns an empty array when only default format is supported
        String getHelp(String format)

        {...}

        // returns default string with tool name, but zero parameters
        abstract void run(String format, String args[]); // format is null when not specified
        }

        Util methods from AbstractCLITool could be moved into a util class and imported via static import.

        What do you think?

        I committed a few things, in case you make changes, please update your branch.

        Show
        Joern Kottmann added a comment - I am proposing the following change to simplify these Abstract classes and interfaces. We make one abstract CmdLineTool class which can get one parameter. The parameter could be used for anything, and must not always be a format. abstract class CmdLineTool { void getName() {..} //default implementation which uses the class name to get the name String getShortDescription() {return "";} String[] supportedFormats(); // returns an empty array when only default format is supported String getHelp(String format) {...} // returns default string with tool name, but zero parameters abstract void run(String format, String args[]); // format is null when not specified } Util methods from AbstractCLITool could be moved into a util class and imported via static import. What do you think? I committed a few things, in case you make changes, please update your branch.
        Hide
        Joern Kottmann added a comment -

        It never good to mix jiras, it is much easier for everyone to participate if changes are done per jira only. Changes and refactoring which don't fit into the current jira should be done under a new jira (e.g. extending the intro section of the documentation, doing typo fixing and other changes).

        I did split up the documentation changes and committed them under the actual jiras as good as possible.

        Show
        Joern Kottmann added a comment - It never good to mix jiras, it is much easier for everyone to participate if changes are done per jira only. Changes and refactoring which don't fit into the current jira should be done under a new jira (e.g. extending the intro section of the documentation, doing typo fixing and other changes). I did split up the documentation changes and committed them under the actual jiras as good as possible.
        Hide
        Aliaksandr Autayeu added a comment -

        I agree about not mixing JIRAs, in fact, I wrote about it in the first place. Why didn't I follow what I preach? Because separating the changes into commits is can be difficult if you're in the middle of a refactoring. And separating them into 4-5 patches is tougher still, because of the submit-wait-update-integrate-format-new-patch loop.

        Show
        Aliaksandr Autayeu added a comment - I agree about not mixing JIRAs, in fact, I wrote about it in the first place. Why didn't I follow what I preach? Because separating the changes into commits is can be difficult if you're in the middle of a refactoring. And separating them into 4-5 patches is tougher still, because of the submit-wait-update-integrate-format-new-patch loop.
        Hide
        Aliaksandr Autayeu added a comment -

        About AbstractXXX classes. I have reviewed the hierarchy now, once again and see some space for improvement. I agree it should be refactored once more.

        Current hierarchy of AbstractXXX classes is made to avoid code duplication. It is not very consistent in naming because I can't rename existing classes (I've just been asked to revert one such proposal). Now I have reviewed it again and see some possible improvements, but simplifying it to one class and interface might make it easier to understand, but will make other things messy.

        For example, I like the idea making format support more explicit in the interface (supportedFormats, getHelp, etc above), but I don't like the idea of mixing everything into one place. Things like " The parameter could be used for anything, and must not always be a format. " is problematic, because it hides the semantics of the parameter. If format parameter is not a format anymore, it should be called differently, or made anonymous (e.g. args[]) otherwise it is misleading.

        Current class hierarchy (attached diagram) separates the tools which does not need anything (CmdLineTool), the tools which work with samples of some kind and recognize formats (AbstactTypedTool) and tools which have parameters formalized (AbstractCLITool<P>). Then there are more specific tools: trainers, cross validators, evaluators, etc.

        I thought a bit and spaces for improvement lie in adding some methods from your comment to interfaces, in improving this place: BaseCLITool extends AbstractCLITool<Class>, and hierarchy just a bit (AbstractCLITool<P> and CmdLineTool is kind of parallel).

        Show
        Aliaksandr Autayeu added a comment - About AbstractXXX classes. I have reviewed the hierarchy now, once again and see some space for improvement. I agree it should be refactored once more. Current hierarchy of AbstractXXX classes is made to avoid code duplication. It is not very consistent in naming because I can't rename existing classes (I've just been asked to revert one such proposal). Now I have reviewed it again and see some possible improvements, but simplifying it to one class and interface might make it easier to understand, but will make other things messy. For example, I like the idea making format support more explicit in the interface (supportedFormats, getHelp, etc above), but I don't like the idea of mixing everything into one place. Things like " The parameter could be used for anything, and must not always be a format. " is problematic, because it hides the semantics of the parameter. If format parameter is not a format anymore, it should be called differently, or made anonymous (e.g. args[]) otherwise it is misleading. Current class hierarchy (attached diagram) separates the tools which does not need anything (CmdLineTool), the tools which work with samples of some kind and recognize formats (AbstactTypedTool) and tools which have parameters formalized (AbstractCLITool<P>). Then there are more specific tools: trainers, cross validators, evaluators, etc. I thought a bit and spaces for improvement lie in adding some methods from your comment to interfaces, in improving this place: BaseCLITool extends AbstractCLITool<Class>, and hierarchy just a bit (AbstractCLITool<P> and CmdLineTool is kind of parallel).
        Hide
        Joern Kottmann added a comment -

        To clarify the class renaming statement, we cannot rename classes which are part of our public API. All classes which are located inside the cmdline package are not public API.
        So we can replaces interfaces with abstract classes, rename things, etc.

        What do you think about my proposed CmdLineTool as the top class? AbstractCLITool could be a util class and everything could be imported via static imports.

        I suggest that we commit now what is there and we then do additional changes to the AbstractXXX classes in a new patch.

        Show
        Joern Kottmann added a comment - To clarify the class renaming statement, we cannot rename classes which are part of our public API. All classes which are located inside the cmdline package are not public API. So we can replaces interfaces with abstract classes, rename things, etc. What do you think about my proposed CmdLineTool as the top class? AbstractCLITool could be a util class and everything could be imported via static imports. I suggest that we commit now what is there and we then do additional changes to the AbstractXXX classes in a new patch.
        Hide
        Joern Kottmann added a comment -

        Patch is now applied completely.

        Show
        Joern Kottmann added a comment - Patch is now applied completely.
        Hide
        Joern Kottmann added a comment -

        Anyway, can you please have a look at the other jiras you fixed and close them if they are done? I assigned them all to you.

        Show
        Joern Kottmann added a comment - Anyway, can you please have a look at the other jiras you fixed and close them if they are done? I assigned them all to you.
        Hide
        Aliaksandr Autayeu added a comment -

        OK, I'll make another proposal for the class hierarchy in CLI package, taking into account current discussion. I'll see other tickets as well.

        Show
        Aliaksandr Autayeu added a comment - OK, I'll make another proposal for the class hierarchy in CLI package, taking into account current discussion. I'll see other tickets as well.
        Hide
        Aliaksandr Autayeu added a comment -

        I have improved class hierarchy and naming here. Now, the name follow the convention, interface and class hierarchy are parallel (aligned) and there is more logic in the package class structure. As a confirmation that this is the right way, some <Class> tricks were eliminated, which is a bonus

        So, now:

        CmdLineTool is the root interface, with two children: BasicCmdLineTool (for simple tools) and TypedCmdLineTool (for tools with formats and extra params like evaluators and trainers). There is a bunch of "basement" implementations, they all start with Abstract. AbstractCmdLineTool is the root one, with two children, AbstractBasicCmdLineTool (base class for simple tools) and AbstractTypedTool (for those with formats). AbstractTypedTool further specializes into AbstractTypedParamTool, which provides extra parameter support for its children, trainers and evaluators: AbstractTrainerTool and AbstractEvaluatorTool. And then there is the rest of the forest.

        See attached opennlp-cmdline-package-class-structure.png for more details.

        I believe that after feedback on the last commit and testing, the issue now can be finally closed Guys, please, go ahead and give me some feedback. Joern, please, give it a look, you reviewed original refactoring, so might still remember the code. Thank you!

        Show
        Aliaksandr Autayeu added a comment - I have improved class hierarchy and naming here. Now, the name follow the convention, interface and class hierarchy are parallel (aligned) and there is more logic in the package class structure. As a confirmation that this is the right way, some <Class> tricks were eliminated, which is a bonus So, now: CmdLineTool is the root interface, with two children: BasicCmdLineTool (for simple tools) and TypedCmdLineTool (for tools with formats and extra params like evaluators and trainers). There is a bunch of "basement" implementations, they all start with Abstract. AbstractCmdLineTool is the root one, with two children, AbstractBasicCmdLineTool (base class for simple tools) and AbstractTypedTool (for those with formats). AbstractTypedTool further specializes into AbstractTypedParamTool, which provides extra parameter support for its children, trainers and evaluators: AbstractTrainerTool and AbstractEvaluatorTool. And then there is the rest of the forest. See attached opennlp-cmdline-package-class-structure.png for more details. I believe that after feedback on the last commit and testing, the issue now can be finally closed Guys, please, go ahead and give me some feedback. Joern, please, give it a look, you reviewed original refactoring, so might still remember the code. Thank you!
        Hide
        Joern Kottmann added a comment -

        Can this issue be closed?

        Show
        Joern Kottmann added a comment - Can this issue be closed?
        Hide
        Aliaksandr Autayeu added a comment -

        I was actually waiting for you to give it a final approving look

        Show
        Aliaksandr Autayeu added a comment - I was actually waiting for you to give it a final approving look
        Hide
        Joern Kottmann added a comment -

        Sorry it took me a while. I think it is still kind of complicated, because there are so many abstract super classes and interfaces.

        What do you think about replacing the CmdLineTool with AbstractCmdLineTool. We don't really need
        the CmdLineTool as interface, it could be an abstract class.

        Then we do the same with BasicCmdLineTool and TypedCmdLineTool.
        And it would be good if we can make it more clear what the difference between the two is based on the name.
        The javadoc comment for TypedCmdLineTool is clear, but the class name doesn't say anything about that.

        What do you think?

        Show
        Joern Kottmann added a comment - Sorry it took me a while. I think it is still kind of complicated, because there are so many abstract super classes and interfaces. What do you think about replacing the CmdLineTool with AbstractCmdLineTool. We don't really need the CmdLineTool as interface, it could be an abstract class. Then we do the same with BasicCmdLineTool and TypedCmdLineTool. And it would be good if we can make it more clear what the difference between the two is based on the name. The javadoc comment for TypedCmdLineTool is clear, but the class name doesn't say anything about that. What do you think?
        Hide
        Joern Kottmann added a comment -

        We should have a look at this issue again before we release 1.5.3.

        My two main concerns are these:

        • Complicated structure with all these abstract super classes and interfaces (could easily be changed after 1.5.3)
        • The language parameter is mandatory in places where it was not used before, and does not have any meaning, e.g. while doing evaluation with sample date, there the language from the model is used.
        Show
        Joern Kottmann added a comment - We should have a look at this issue again before we release 1.5.3. My two main concerns are these: Complicated structure with all these abstract super classes and interfaces (could easily be changed after 1.5.3) The language parameter is mandatory in places where it was not used before, and does not have any meaning, e.g. while doing evaluation with sample date, there the language from the model is used.

          People

          • Assignee:
            Joern Kottmann
            Reporter:
            Aliaksandr Autayeu
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development