OpenNLP
  1. OpenNLP
  2. OPENNLP-239

Case Sensitivie Flag & Custom Tag Dictionary

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: tools-1.5.1-incubating
    • Fix Version/s: tools-1.5.2-incubating
    • Component/s: Parser
    • Labels:
      None

      Description

      Unable to set case sensitive flag as per TreebankParser 1.3.1 or use a custom tag dictionary

        Activity

        Hide
        James Kosin added a comment -

        Yes, that along with why POSDictionary is needed. I didn't see much that was different from the current Dictionary.

        Show
        James Kosin added a comment - Yes, that along with why POSDictionary is needed. I didn't see much that was different from the current Dictionary.
        Hide
        Joern Kottmann added a comment -

        I fixed the POSDictionary, or did I miss something? Can you please review the changes I made there?

        You pointed out an issue in the Dictionary method which simply returns a string set, I believe that still needs fixing, right?

        Show
        Joern Kottmann added a comment - I fixed the POSDictionary, or did I miss something? Can you please review the changes I made there? You pointed out an issue in the Dictionary method which simply returns a string set, I believe that still needs fixing, right?
        Hide
        James Kosin added a comment -

        Jorn,

        Will close this one. We do need to open a new discussion on the purpose of the POSDictionary and its usage of the caseSensitivity flag. And if needed JIRA issues created to re-work that code.

        Also, I'm not sure if the changes for saving the case sensitivity flag actaully fix the problem/issue here or that I've only addressed one aspect of the issue.

        James

        Show
        James Kosin added a comment - Jorn, Will close this one. We do need to open a new discussion on the purpose of the POSDictionary and its usage of the caseSensitivity flag. And if needed JIRA issues created to re-work that code. Also, I'm not sure if the changes for saving the case sensitivity flag actaully fix the problem/issue here or that I've only addressed one aspect of the issue. James
        Hide
        Joern Kottmann added a comment -

        James, are we done here? Can the issue be closed?

        Show
        Joern Kottmann added a comment - James, are we done here? Can the issue be closed?
        Hide
        James Kosin added a comment -

        I've gone over everything for the Dictionary class now. The tests are all working.
        The tests for case sensitive and insensitive dictionaries hints at a few areas we may want to work on. But, I'll defer them for other JIRA issues after discussing them on the development list.

        The POS Dictionary needs a lot of work. I'll open a new issue for that with details. Mostly because the POSDictionary class tries to implement its own case sensitivity flag for other reasons... or maybe the same reasons. I'll post that as a bigger re-write for a later release.

        Can we close this issue?

        Show
        James Kosin added a comment - I've gone over everything for the Dictionary class now. The tests are all working. The tests for case sensitive and insensitive dictionaries hints at a few areas we may want to work on. But, I'll defer them for other JIRA issues after discussing them on the development list. The POS Dictionary needs a lot of work. I'll open a new issue for that with details. Mostly because the POSDictionary class tries to implement its own case sensitivity flag for other reasons... or maybe the same reasons. I'll post that as a bigger re-write for a later release. Can we close this issue?
        Hide
        James Kosin added a comment - - edited

        Jorn,

        Sorry, I've been busy... The default was because I didn't know the
        default state for many of the models. Most of the time it is based on
        how they are created. I can fix that easily; so that it gets set to
        true if not present.

        I found a way to implement without being static.

        I'm also not familiar with the UMIA branch of OpenNLP or what the StringDictionary's requirements are... it appears to not be using any case sensitive flags; so... I wrote a default of true (to keep the flag the default).

        James

        Show
        James Kosin added a comment - - edited Jorn, Sorry, I've been busy... The default was because I didn't know the default state for many of the models. Most of the time it is based on how they are created. I can fix that easily; so that it gets set to true if not present. I found a way to implement without being static. I'm also not familiar with the UMIA branch of OpenNLP or what the StringDictionary's requirements are... it appears to not be using any case sensitive flags; so... I wrote a default of true (to keep the flag the default). James
        Hide
        Joern Kottmann added a comment -

        One more comment, the case sensitive flag in DictionarySerializer must not be static, because this causes a thread safety issue.
        We need to find another way to implement this.

        Show
        Joern Kottmann added a comment - One more comment, the case sensitive flag in DictionarySerializer must not be static, because this causes a thread safety issue. We need to find another way to implement this.
        Hide
        Joern Kottmann added a comment -

        There is also a compile error in the uima integration code, there is a class which tries to create a dictionary, but the method now requires also a boolean flag. So obviously that was not a backward compatible change.

        Can we do that in the dictionary serializer code? I am not sure, lets speak about it on the ML.
        Would be nice to get a few opinons about this and then decide based on that.

        Show
        Joern Kottmann added a comment - There is also a compile error in the uima integration code, there is a class which tries to create a dictionary, but the method now requires also a boolean flag. So obviously that was not a backward compatible change. Can we do that in the dictionary serializer code? I am not sure, lets speak about it on the ML. Would be nice to get a few opinons about this and then decide based on that.
        Hide
        Joern Kottmann added a comment -

        James, it looks like you flipped the default, right? Before we had this flag, the dictionary was always case sensitive after it was deserialized from a model, now it is case insensitive after it is deserialized from a model. It is not a big deal, but it can be very confusing in certain cases. I suggest that the default is case sensitive if the attribute in the xml does not exist.

        What do you think?

        Show
        Joern Kottmann added a comment - James, it looks like you flipped the default, right? Before we had this flag, the dictionary was always case sensitive after it was deserialized from a model, now it is case insensitive after it is deserialized from a model. It is not a big deal, but it can be very confusing in certain cases. I suggest that the default is case sensitive if the attribute in the xml does not exist. What do you think?
        Hide
        Joern Kottmann added a comment -

        POSDictionary is doing based on the flag a String.toLowerCase(). The current implementation works correctly. To speed up things we would need to create another wrapper and use a hashCode method which also ignores the case.

        Show
        Joern Kottmann added a comment - POSDictionary is doing based on the flag a String.toLowerCase(). The current implementation works correctly. To speed up things we would need to create another wrapper and use a hashCode method which also ignores the case.
        Hide
        James Kosin added a comment -

        Okay, brushed up on hashing.... and don't see any problems with the current implementation.
        I'll need to get the POSDictionary a wrapper though for the String compares to be case sensitive/insensitive as set by the flag.

        Show
        James Kosin added a comment - Okay, brushed up on hashing.... and don't see any problems with the current implementation. I'll need to get the POSDictionary a wrapper though for the String compares to be case sensitive/insensitive as set by the flag.
        Hide
        James Kosin added a comment -

        Jorn,
        The problem is the POSDictionary isn't like the normal Dictionary so; it doesn't use the StringWrapper. The POSDictionary also uses a HashTable... So the case flag is just there; however, even if it is set doesn't do anything. What we need to do is decide if we want to use the normal Dictionary class or create another wrapper class around the strings to get the correct compares. That aside, I'm not sure how the hash() function comes into play; but, we may have problems with the Dictionary class already, because the hash() is always with lower-case letters regardless of the case sensitivity setting. This maybe wrong if Java uses the hash() to determine if two items are identical.
        I have to dig a bit more into this to determine if we need any more changes.

        Show
        James Kosin added a comment - Jorn, The problem is the POSDictionary isn't like the normal Dictionary so; it doesn't use the StringWrapper. The POSDictionary also uses a HashTable... So the case flag is just there; however, even if it is set doesn't do anything. What we need to do is decide if we want to use the normal Dictionary class or create another wrapper class around the strings to get the correct compares. That aside, I'm not sure how the hash() function comes into play; but, we may have problems with the Dictionary class already, because the hash() is always with lower-case letters regardless of the case sensitivity setting. This maybe wrong if Java uses the hash() to determine if two items are identical. I have to dig a bit more into this to determine if we need any more changes.
        Hide
        Joern Kottmann added a comment -

        The POSDictionary.create method needs to set the case sensitive flag. The code looks like ti will work when this is fixed. I already changed that in my local copy, should I check in this change James?

        The comments said it does not work, is there any other reason for that?

        Show
        Joern Kottmann added a comment - The POSDictionary.create method needs to set the case sensitive flag. The code looks like ti will work when this is fixed. I already changed that in my local copy, should I check in this change James? The comments said it does not work, is there any other reason for that?
        Hide
        Joern Kottmann added a comment -

        Inside DictionarySerializer you define a string constant ATTRIBUTE_CASE. It would be more expressive if it is named ATTRIBUTE_CASE_SENSITIVE, and the same goes for the value. "Is it case sensitive?" makes more sense than "Is it case?".

        Show
        Joern Kottmann added a comment - Inside DictionarySerializer you define a string constant ATTRIBUTE_CASE. It would be more expressive if it is named ATTRIBUTE_CASE_SENSITIVE, and the same goes for the value. "Is it case sensitive?" makes more sense than "Is it case?".
        Hide
        James Kosin added a comment -

        Okay, the code is saving the case sensitivity now. To support the older dictionaries, the code will treat any older dictionary as a 'false' for the case sensitivity.
        The original poster seems to have a problem with the POSDictionary I believe. So, I'm off to fix that.

        Show
        James Kosin added a comment - Okay, the code is saving the case sensitivity now. To support the older dictionaries, the code will treat any older dictionary as a 'false' for the case sensitivity. The original poster seems to have a problem with the POSDictionary I believe. So, I'm off to fix that.
        Hide
        James Kosin added a comment -

        I also fixed a bug. No effect to the code; because the flag wasn't being used outside of setting true/false.
        There was an
        <code>
        if () {
        } else if () {
        } else if() {
        }
        </code>
        that was testing the same token string in the first and last if () cases. Of course, the last one would NEVER get triggered; so, the flag would never get set false... once set.

        Show
        James Kosin added a comment - I also fixed a bug. No effect to the code; because the flag wasn't being used outside of setting true/false. There was an <code> if () { } else if () { } else if() { } </code> that was testing the same token string in the first and last if () cases. Of course, the last one would NEVER get triggered; so, the flag would never get set false... once set.
        Hide
        James Kosin added a comment -

        I've made changes; however, I have to review 3 other classes that use the DictionarySerializer to serialize other Dictionary Types.

        NGramModel.java – looks like this is really just serializing the data. No refrences to caseSensitivity.
        POSDictionary.java – uses caseSensitivity; however, comments say it doesn't work and is always set to true.
        DetokenizationDictionary.java – doesn't seem to use and is just serializing.

        Please REVIEW these classes as well and make any needed changes. I'm going to keep digging for answers.

        Show
        James Kosin added a comment - I've made changes; however, I have to review 3 other classes that use the DictionarySerializer to serialize other Dictionary Types. NGramModel.java – looks like this is really just serializing the data. No refrences to caseSensitivity. POSDictionary.java – uses caseSensitivity; however, comments say it doesn't work and is always set to true. DetokenizationDictionary.java – doesn't seem to use and is just serializing. Please REVIEW these classes as well and make any needed changes. I'm going to keep digging for answers.
        Hide
        James Kosin added a comment -

        I'll do the changes, unless you already have. I'm not familiar on how to roll back the changes; but, they are very easy to do.

        Show
        James Kosin added a comment - I'll do the changes, unless you already have. I'm not familiar on how to roll back the changes; but, they are very easy to do.
        Hide
        James Kosin added a comment -

        I'm fine either way. If we remove the case; do we have to review the methods that use the dictionary to verify the proper usages?

        James

        Show
        James Kosin added a comment - I'm fine either way. If we remove the case; do we have to review the methods that use the dictionary to verify the proper usages? James
        Hide
        Joern Kottmann added a comment -

        The new put method should be removed again, since it makes it unclear how to handle the case flag. This add the semantics of a case flag per entry to the Dictionary, which we did not have previously.

        What do you think James? Should we go back to the revision at which you started, and then simply the code to only rely on one final case flag?

        Show
        Joern Kottmann added a comment - The new put method should be removed again, since it makes it unclear how to handle the case flag. This add the semantics of a case flag per entry to the Dictionary, which we did not have previously. What do you think James? Should we go back to the revision at which you started, and then simply the code to only rely on one final case flag?
        Hide
        Joern Kottmann added a comment -

        Just looked a little at the code, the StringListWrapper.equals is then also a bit simpler, because it only needs to use one flag to determine if the compare is case sensitive or not.

        Show
        Joern Kottmann added a comment - Just looked a little at the code, the StringListWrapper.equals is then also a bit simpler, because it only needs to use one flag to determine if the compare is case sensitive or not.
        Hide
        Joern Kottmann added a comment - - edited

        I understand the comparison issue and that is why we need to write the case flag at the begin of the dictionary. The Dictionary class never supported switching of the case after it was created. So just having one final flag per dictionary is fine.

        StringListWarpper will be made non-static, and the case sensitive flag final. We need to change the serialization API a little to support to read and store the case flag. And then our issues are solved.

        Since Dictionary should be immutable we should think about making the put method deprecated.

        Show
        Joern Kottmann added a comment - - edited I understand the comparison issue and that is why we need to write the case flag at the begin of the dictionary. The Dictionary class never supported switching of the case after it was created. So just having one final flag per dictionary is fine. StringListWarpper will be made non-static, and the case sensitive flag final. We need to change the serialization API a little to support to read and store the case flag. And then our issues are solved. Since Dictionary should be immutable we should think about making the put method deprecated.
        Hide
        James Kosin added a comment -

        Maybe... the issue is also with comparisons. The Dictionary class takes a flag for the case sensitivity and the equals() method is adjusted based on this flag for all the classes and sub-classes that are returned from the Dictionary to allow the flag to persist. It is one reason why it would be difficult to remove the duplicate isCaseSensitive flag from the StringListWrapper class.

        If the dictionary were created with the flag to true, then all entries would use a case sensitive comparison. The down side is that some of the tools allow us to modify the case sensitivity after dictionary creation in the sense of chaning the meaning (when I say creation I mean when it was created from the original data and not when the user is using the dictionary).

        The original code however would always use the dictionary flag when created to determine the case comparison rules (when the user created the dictionary from the XML stream). This looses the original intent I believe of the flag and the dictionary may not work as intended.

        The changes I've done so far allows us to change the dictionary flag (to the users preference), if allowed, and allow the dictionary to be used to compare two different strings with the appropriate case sensitive or insensitive test.
        Unmutable also means the DictionarySerializer would need to return the Dictionary class created from the input stream.... which I was trying to avoid if possible.

        Show
        James Kosin added a comment - Maybe... the issue is also with comparisons. The Dictionary class takes a flag for the case sensitivity and the equals() method is adjusted based on this flag for all the classes and sub-classes that are returned from the Dictionary to allow the flag to persist. It is one reason why it would be difficult to remove the duplicate isCaseSensitive flag from the StringListWrapper class. If the dictionary were created with the flag to true, then all entries would use a case sensitive comparison. The down side is that some of the tools allow us to modify the case sensitivity after dictionary creation in the sense of chaning the meaning (when I say creation I mean when it was created from the original data and not when the user is using the dictionary). The original code however would always use the dictionary flag when created to determine the case comparison rules (when the user created the dictionary from the XML stream). This looses the original intent I believe of the flag and the dictionary may not work as intended. The changes I've done so far allows us to change the dictionary flag (to the users preference), if allowed, and allow the dictionary to be used to compare two different strings with the appropriate case sensitive or insensitive test. Unmutable also means the DictionarySerializer would need to return the Dictionary class created from the input stream.... which I was trying to avoid if possible.
        Hide
        Joern Kottmann added a comment -

        I believe it was a mistake to make the Dictionary class not immutable. The case sensitive flag should be final, and it should not be possible to change it. This way I think most of the problems explained above go away, right?

        Show
        Joern Kottmann added a comment - I believe it was a mistake to make the Dictionary class not immutable. The case sensitive flag should be final, and it should not be possible to change it. This way I think most of the problems explained above go away, right?
        Hide
        James Kosin added a comment -

        Okay,

        Here are some of the issues:
        (a) I didn't add a check for duplicates, it was only a statement of fact that there is only one. But, it does matter when building the dictionary that if case sensitive or not that a dictionary that is not case sensitive may want to keep the duplicates out, especially if when building the dictionary they are just adding them as they are originally done which may cause issues later.

        (b) I tried reviewing a method to remove the isCaseSensitive entry from the StringListWrapper... unfortunately, we have a few small problems: (1) the StringListWrapper is a static class (most likely to increase performance), this means the StringListWrapper doesn't have access to the Dictionary caseSensitive flag to follow the entries. (2) even if I change that, then a user will be allowed to change a Dictionary that was created to be case insensitive to a case sensitive type which will most likely produce undesireable results [what may be happening now].

        The current implementation, allows a bit more flexibility:
        Dictionary Entry Other Entry Result
        -------------- -------- ---------------- -----------
        true true true a case sensitive comparison is done with equals()
        false true false a case insensitive comparison is done with equals()
        xxxx false xxxx a case insensitive comparison is done with equals()

        This can be changed later if needed; but in English. If either one is case insensitive when doing a compare then the compare is done with case insensitive comparison. ie: 'Hello' and 'hello' are always the same equals(),

        I kept them separate for now in that we could create a Dictionary class later that allows a mixed comparison with other strings... ie: If we had a string we know always to be capitalized and case sensitive like 'ASCII' we could put the entry in a Dictionary that also contains entries that have case insensitivity, with other words containign false for the case attribute.

        The bad part about putting the flag only in the Dictionary flag is that then we only have one flag, and we loose the original Dictionary meaning and then having the flag also comes into question for building the Dictionary.

        Sorry, I don't really mean to sound ugly... I'm just trying to get some discussion on this and what we could do about the problem. I'm okay with moving the attribute to the dictionary attributes section instead of the entry attributes section. Just trying to keep options open; since, it doesn't look like moving the flag is easily done without causing more issues.

        Show
        James Kosin added a comment - Okay, Here are some of the issues: (a) I didn't add a check for duplicates, it was only a statement of fact that there is only one. But, it does matter when building the dictionary that if case sensitive or not that a dictionary that is not case sensitive may want to keep the duplicates out, especially if when building the dictionary they are just adding them as they are originally done which may cause issues later. (b) I tried reviewing a method to remove the isCaseSensitive entry from the StringListWrapper... unfortunately, we have a few small problems: (1) the StringListWrapper is a static class (most likely to increase performance), this means the StringListWrapper doesn't have access to the Dictionary caseSensitive flag to follow the entries. (2) even if I change that, then a user will be allowed to change a Dictionary that was created to be case insensitive to a case sensitive type which will most likely produce undesireable results [what may be happening now] . The current implementation, allows a bit more flexibility: Dictionary Entry Other Entry Result -------------- -------- ---------------- ----------- true true true a case sensitive comparison is done with equals() false true false a case insensitive comparison is done with equals() xxxx false xxxx a case insensitive comparison is done with equals() This can be changed later if needed; but in English. If either one is case insensitive when doing a compare then the compare is done with case insensitive comparison. ie: 'Hello' and 'hello' are always the same equals(), I kept them separate for now in that we could create a Dictionary class later that allows a mixed comparison with other strings... ie: If we had a string we know always to be capitalized and case sensitive like 'ASCII' we could put the entry in a Dictionary that also contains entries that have case insensitivity, with other words containign false for the case attribute. The bad part about putting the flag only in the Dictionary flag is that then we only have one flag, and we loose the original Dictionary meaning and then having the flag also comes into question for building the Dictionary. Sorry, I don't really mean to sound ugly... I'm just trying to get some discussion on this and what we could do about the problem. I'm okay with moving the attribute to the dictionary attributes section instead of the entry attributes section. Just trying to keep options open; since, it doesn't look like moving the flag is easily done without causing more issues.
        Hide
        Joern Kottmann added a comment -

        We should not add the flag per entry, it should be only once at the beginning of the dictionary. I think it should be removed from StringListWrapper as well, and only be in the Dictionary class.

        If we have duplicate entries it does not matter because the dictionary is a set, which contains all strings mentioned in the dictionary file, if there is one twice it will still only be one in the set.

        Show
        Joern Kottmann added a comment - We should not add the flag per entry, it should be only once at the beginning of the dictionary. I think it should be removed from StringListWrapper as well, and only be in the Dictionary class. If we have duplicate entries it does not matter because the dictionary is a set, which contains all strings mentioned in the dictionary file, if there is one twice it will still only be one in the set.
        Hide
        James Kosin added a comment -

        Could someone review the changes please. I've made the entry a lowercase "case" attribute instead of upper case as in the example above.
        Thanks,

        Show
        James Kosin added a comment - Could someone review the changes please. I've made the entry a lowercase "case" attribute instead of upper case as in the example above. Thanks,
        Hide
        James Kosin added a comment -

        Okay,
        (1) I'm going to implement an attribute for the case sensitivity setting for the entries being saved to the dictionary. I'm going to keep them separate for now and not try to combine them into the dictionary object. The reason being; I'd like to keep the created Dictionary object default the user's required default setting for the comparison and not for the original dictionary elements as they where added. You'll understand as things progress.

        This means the new dictionary output will be like this
        <code>
        <dictionary>
        <entry CASE="true">
        <token>Patrick</token>
        </entry>
        </code>

        (2) Case sensitivity will be dependant on how the dictionary is created; however, if the user selects case insensitive comparison from the command line, the false value will take precidence to allow for a case insensitive compare on equals(). This means a dictionary created with the false default will not compare case sensitve regardless of any true value specified by the user. So, create the dictionary accordingly to the true case sensitivity that is required.

        (3) I'm hoping the attributes in the XML will allow us to be backward compatible; since they will not exist otherwise. I'm also hoping the case attribute will be backward compatible as well in that the older inplementation never used an attribute value system.

        Show
        James Kosin added a comment - Okay, (1) I'm going to implement an attribute for the case sensitivity setting for the entries being saved to the dictionary. I'm going to keep them separate for now and not try to combine them into the dictionary object. The reason being; I'd like to keep the created Dictionary object default the user's required default setting for the comparison and not for the original dictionary elements as they where added. You'll understand as things progress. This means the new dictionary output will be like this <code> <dictionary> <entry CASE="true"> <token>Patrick</token> </entry> </code> (2) Case sensitivity will be dependant on how the dictionary is created; however, if the user selects case insensitive comparison from the command line, the false value will take precidence to allow for a case insensitive compare on equals(). This means a dictionary created with the false default will not compare case sensitve regardless of any true value specified by the user. So, create the dictionary accordingly to the true case sensitivity that is required. (3) I'm hoping the attributes in the XML will allow us to be backward compatible; since they will not exist otherwise. I'm also hoping the case attribute will be backward compatible as well in that the older inplementation never used an attribute value system.
        Hide
        James Kosin added a comment -

        Alright,
        (1) Even thought the StringListWrapper makes things complicated, there isn't an easier way to return a StringList with case sensitivity or insensitivity while using the equals() method. The other method would involve going through all the uses for the Dictionary and the StringList and determine how to allow for case sensitive and insensitive testing. Not really nice.
        (2) Saving the case sensitivity of the Dictionary is becoming a requirement; however, we need clairity on two options... (a) default setting and the priority... we can make case insensitive the default and the priority... ie: setting another to true and trying to compare will result still in a case insensitive or the otherway around. (b) saving the case sensitivity is a bit problematic in that the flag is required for the dictionary on creation and is used as entries are added... this means, it is difficult to restore the flag after the fact, or even before without creating a new dictionary class to store the entries.

        Show
        James Kosin added a comment - Alright, (1) Even thought the StringListWrapper makes things complicated, there isn't an easier way to return a StringList with case sensitivity or insensitivity while using the equals() method. The other method would involve going through all the uses for the Dictionary and the StringList and determine how to allow for case sensitive and insensitive testing. Not really nice. (2) Saving the case sensitivity of the Dictionary is becoming a requirement; however, we need clairity on two options... (a) default setting and the priority... we can make case insensitive the default and the priority... ie: setting another to true and trying to compare will result still in a case insensitive or the otherway around. (b) saving the case sensitivity is a bit problematic in that the flag is required for the dictionary on creation and is used as entries are added... this means, it is difficult to restore the flag after the fact, or even before without creating a new dictionary class to store the entries.
        Hide
        James Kosin added a comment -

        Thanks William,

        I see that now.

        Show
        James Kosin added a comment - Thanks William, I see that now.
        Hide
        William Colen added a comment -

        BTW there are unit tests:
        opennlp.tools.dictionary.DictionaryAsSetCaseInsensitiveTest
        opennlp.tools.dictionary.DictionaryAsSetCaseSensitiveTest

        Show
        William Colen added a comment - BTW there are unit tests: opennlp.tools.dictionary.DictionaryAsSetCaseInsensitiveTest opennlp.tools.dictionary.DictionaryAsSetCaseSensitiveTest
        Hide
        William Colen added a comment -

        The asStringSet returns an special implementation of Set. It relies on the dictionary field entrySet that is a Set<StringListWrapper>, so the hashcode and equals are override and the implementation use the flag. Also the returned Set<String> overrides the contains method:

        public boolean contains(Object obj) {
        boolean result = false;
        if (obj instanceof String)

        { String str = (String) obj; result = entrySet.contains(new StringListWrapper(new StringList(str), caseSensitive)); }

        return result;
        }

        it wraps the string with the StringListWrapper adding the flag so the entrySet.contains(..) should work.

        Show
        William Colen added a comment - The asStringSet returns an special implementation of Set. It relies on the dictionary field entrySet that is a Set<StringListWrapper>, so the hashcode and equals are override and the implementation use the flag. Also the returned Set<String> overrides the contains method: public boolean contains(Object obj) { boolean result = false; if (obj instanceof String) { String str = (String) obj; result = entrySet.contains(new StringListWrapper(new StringList(str), caseSensitive)); } return result; } it wraps the string with the StringListWrapper adding the flag so the entrySet.contains(..) should work.
        Hide
        James Kosin added a comment -

        Hmm... very interesting.

        I've actually got the Dictionary{} class replaced by a completely empty class and the caseSensitivity flag isn't really being used anywhere.

        ie: TokenizerME uses dictionary to get a asStringSet() to return a Set<String> for usage... loosing case flag.

        List goes on...
        I'll get a more complete list soon.

        Show
        James Kosin added a comment - Hmm... very interesting. I've actually got the Dictionary{} class replaced by a completely empty class and the caseSensitivity flag isn't really being used anywhere. ie: TokenizerME uses dictionary to get a asStringSet() to return a Set<String> for usage... loosing case flag. List goes on... I'll get a more complete list soon.
        Hide
        William Colen added a comment -

        James, are you following the conversation "Case sensitivity of a dictionary loaded from the model" in Dev list?

        In my opinion we should stop using this StringListWrapper. This class is a good idea but is making things complicated. It wraps the StringList to add the case sensitivity flag and overrides equals and hashcode methods according to the flag.
        Instead of it, in my opinion, in case of a case insensitive dictionary, we should lower case the entries during the dictionary creation and persist the flag to the serialized file.

        Show
        William Colen added a comment - James, are you following the conversation "Case sensitivity of a dictionary loaded from the model" in Dev list? In my opinion we should stop using this StringListWrapper. This class is a good idea but is making things complicated. It wraps the StringList to add the case sensitivity flag and overrides equals and hashcode methods according to the flag. Instead of it, in my opinion, in case of a case insensitive dictionary, we should lower case the entries during the dictionary creation and persist the flag to the serialized file.
        Hide
        William Colen added a comment -

        "Should case sensitivity or insensitivity be dominant?"

        It depends on the tool. We should be able to try both while creating a new model and decide it in terms of F1.

        Show
        William Colen added a comment - "Should case sensitivity or insensitivity be dominant?" It depends on the tool. We should be able to try both while creating a new model and decide it in terms of F1.
        Hide
        James Kosin added a comment -

        Should case sensitivity or insensitivity be dominant?

        Show
        James Kosin added a comment - Should case sensitivity or insensitivity be dominant?
        Hide
        James Kosin added a comment -

        Okay,

        I found one problem, outside of the multiple copies.
        Any new dictionary item setting the case sensitivity will be voided by the orignal dictionary setting.

        <code>
        if (isCaseSensitive)

        { result = this.stringlist.equals(other.getStringList()); }

        ...
        </code>

        Ignores the case setting for other and will result in different results depending on the order of the compare.

        It probably should read:
        <code>
        if (isCaseSensitive || other.isCaseSensitive) {
        ...
        </code>

        James

        Show
        James Kosin added a comment - Okay, I found one problem, outside of the multiple copies. Any new dictionary item setting the case sensitivity will be voided by the orignal dictionary setting. <code> if (isCaseSensitive) { result = this.stringlist.equals(other.getStringList()); } ... </code> Ignores the case setting for other and will result in different results depending on the order of the compare. It probably should read: <code> if (isCaseSensitive || other.isCaseSensitive) { ... </code> James
        Hide
        James Kosin added a comment -

        Review...

        The dictionary has the case sensitivity flag and implementes multiple copies throughout.... hmmm... this will make it hard to directly relate any new setting for the caseSensitivity.
        I'm going to compare with the older code to see what changed.

        Show
        James Kosin added a comment - Review... The dictionary has the case sensitivity flag and implementes multiple copies throughout.... hmmm... this will make it hard to directly relate any new setting for the caseSensitivity. I'm going to compare with the older code to see what changed.
        Hide
        James Kosin added a comment -

        I'll look into the case sensitivity flag and determine the best course of action.
        Currently, it looks like we will be adding a configuration item for this to the saved dictionary structure with a method to allow modification.

        Currently, I believe the case sensitivity is only used to determine if two entries that are different case will be added to the dictionary or not.
        I believe what we may be wanting is a flag that can do multiple things; (a) determine if an entry with different case is allowed in the dictionary, (b) be used in the use of the dictionary for case sensitivity, (c) be able to override the default value for the dictionary.

        Let me know if I'm wrong on any counts, as I review the code.

        Show
        James Kosin added a comment - I'll look into the case sensitivity flag and determine the best course of action. Currently, it looks like we will be adding a configuration item for this to the saved dictionary structure with a method to allow modification. Currently, I believe the case sensitivity is only used to determine if two entries that are different case will be added to the dictionary or not. I believe what we may be wanting is a flag that can do multiple things; (a) determine if an entry with different case is allowed in the dictionary, (b) be used in the use of the dictionary for case sensitivity, (c) be able to override the default value for the dictionary. Let me know if I'm wrong on any counts, as I review the code.

          People

          • Assignee:
            James Kosin
            Reporter:
            mark meiklejohn
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development