OpenNLP
  1. OpenNLP
  2. OPENNLP-193

Update POS Tagger cmd line trainer tool to use new xml tag dict format

    Details

      Description

      The POS Tagger trainer cmd line tool uses still the old tag dict format for backward compatibility reasons. The format was replaced by a new xml based dictionary.

      Update the POS Tagger trainer tool to only use the new xml based dictionary format.

        Activity

        Hide
        William Colen added a comment -

        Please check my revision #1148042 (OPENNLP-227). I changed the code related to this issue.
        Maybe I broke compatibility while removing a deprecated method warning.

        Show
        William Colen added a comment - Please check my revision #1148042 ( OPENNLP-227 ). I changed the code related to this issue. Maybe I broke compatibility while removing a deprecated method warning.
        Hide
        Joern Kottmann added a comment -

        Checked, this revision looks fine. As far as I can see it only changed code within the cmdline package, which is not public API, so any changes are fine there.

        Sorry, I could not see the deprecation mentioned in your comment above. Maybe I we have a misunderstanding here?

        Show
        Joern Kottmann added a comment - Checked, this revision looks fine. As far as I can see it only changed code within the cmdline package, which is not public API, so any changes are fine there. Sorry, I could not see the deprecation mentioned in your comment above. Maybe I we have a misunderstanding here?
        Hide
        William Colen added a comment -

        Sorry, it wasn't a deprecated method, but a deprecated constructor. Before it was using the constructor POSDictionary(String file) that is marked as deprecated. Now it is using the create method.

        Show
        William Colen added a comment - Sorry, it wasn't a deprecated method, but a deprecated constructor. Before it was using the constructor POSDictionary(String file) that is marked as deprecated. Now it is using the create method.
        Hide
        Joern Kottmann added a comment -

        Ok, now I also understand why you added the comment to this issue. That something we should discuss on the mailing list.

        Can we do such changes in a revision release?

        If the answer is yes, we should think about doing more changes, e.g. remove iterations and cutoff params, etc.

        Show
        Joern Kottmann added a comment - Ok, now I also understand why you added the comment to this issue. That something we should discuss on the mailing list. Can we do such changes in a revision release? If the answer is yes, we should think about doing more changes, e.g. remove iterations and cutoff params, etc.
        Hide
        William Colen added a comment -

        I don't know if we can do it, we should discuss. It wasn't on purpose. I was reviewing the command line open issues and found this one.

        Show
        William Colen added a comment - I don't know if we can do it, we should discuss. It wasn't on purpose. I was reviewing the command line open issues and found this one.
        Hide
        William Colen added a comment -

        I thought a little about it and, in my opinion, we should not remove cutoff and iteration parameters. Also I should undo this change and let the dictionary be loaded the way it was.
        It will make things easier for users now, and we can not change the API anyway, so I think it won't make thinks much easier for us only changing the command line tools. It will simplify a lot latter in 1.6.0 that we will be able to remove deprecated methods.

        Show
        William Colen added a comment - I thought a little about it and, in my opinion, we should not remove cutoff and iteration parameters. Also I should undo this change and let the dictionary be loaded the way it was. It will make things easier for users now, and we can not change the API anyway, so I think it won't make thinks much easier for us only changing the command line tools. It will simplify a lot latter in 1.6.0 that we will be able to remove deprecated methods.
        Hide
        Joern Kottmann added a comment -

        Why do you think the cutoff and iterations parameters should not be removed? We replaced them with this properties file which can also do many more settings needed for the training.

        Show
        Joern Kottmann added a comment - Why do you think the cutoff and iterations parameters should not be removed? We replaced them with this properties file which can also do many more settings needed for the training.
        Hide
        William Colen added a comment -

        I mean we should not remove it for 1.5.2, only in 1.6 (in my opinion)
        It will not remove a lot of code anyway because we can not remove methods in a revision release. But maybe we don't need to apply the revision rule to the command line interface and we should go ahead and remove cutoff and iterations now.

        Show
        William Colen added a comment - I mean we should not remove it for 1.5.2, only in 1.6 (in my opinion) It will not remove a lot of code anyway because we can not remove methods in a revision release. But maybe we don't need to apply the revision rule to the command line interface and we should go ahead and remove cutoff and iterations now.
        Hide
        Joern Kottmann added a comment -

        Ok, yeah. We then should mark it as deprecated.

        Show
        Joern Kottmann added a comment - Ok, yeah. We then should mark it as deprecated.
        Hide
        Joern Kottmann added a comment -

        William, was is your opinion. Should we make the xml dictionary mandatory, or should we stick to the old format?

        The XML dictionary has the advantage that there can't be any encoding issues.
        If we stick to the old format, we could specify that the dictionary has to be UTF-8.

        Show
        Joern Kottmann added a comment - William, was is your opinion. Should we make the xml dictionary mandatory, or should we stick to the old format? The XML dictionary has the advantage that there can't be any encoding issues. If we stick to the old format, we could specify that the dictionary has to be UTF-8.
        Hide
        William Colen added a comment -

        I think we should remove it. Now we have ways to use any TagDictionary implementation, and I believe this old format is not compatible. In fact I couldn't even find the part of the code that deals with the old dictionary format.

        Show
        William Colen added a comment - I think we should remove it. Now we have ways to use any TagDictionary implementation, and I believe this old format is not compatible. In fact I couldn't even find the part of the code that deals with the old dictionary format.
        Hide
        Joern Kottmann added a comment -

        Looks like the change was already done. The Pos Tagger expects as default the xml based dictionary.

        I think we can close this issue.

        Show
        Joern Kottmann added a comment - Looks like the change was already done. The Pos Tagger expects as default the xml based dictionary. I think we can close this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Joern Kottmann
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development