Commons IO
  1. Commons IO
  2. IO-167

Fix case-insensitive string handling

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 2.0
    • Component/s: None
    • Labels:
      None

      Description

      Case-insensitive operations are currently platform-dependent, please see Common Bug #3 for details.

      1. IO-167-checkIndexOf.patch
        7 kB
        Niall Pemberton
      2. IO-167.patch
        5 kB
        Benjamin Bentmann
      3. IO-167-a.patch
        5 kB
        Benjamin Bentmann
      4. IO-167.patch
        4 kB
        Benjamin Bentmann

        Activity

        Mark Thomas made changes -
        Workflow jira [ 12429833 ] Default workflow, editable Closed status [ 12601689 ]
        Henri Yandell made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Niall Pemberton made changes -
        Resolution Fixed [ 1 ]
        Assignee Niall Pemberton [ niallp ]
        Status Open [ 1 ] Resolved [ 5 ]
        Show
        Niall Pemberton added a comment - Fixed http://svn.apache.org/viewvc?view=rev&revision=661822
        Hide
        Benjamin Bentmann added a comment -

        add a new checkIndexOf() method to IOCase

        Extending IOCase in this way seems consequent to me. I have only some minor wishes remaining:

        • Add a @since tag to the newly added method
        • Extend the IOCaseTestCase to check for proper handling of case-insensitivity with Non-ASCIIs and guard against regressions of this issue. You should be able to simply copy the code from FilenameUtilsWildcardTestCase.testLocaleIndependence() (which is hopefully integrated nevertheless) around and change one or two lines to test checkIndexOf().

        Niall's patch in place, IOCase.convertCase() seems unused and could be deleted.

        Show
        Benjamin Bentmann added a comment - add a new checkIndexOf() method to IOCase Extending IOCase in this way seems consequent to me. I have only some minor wishes remaining: Add a @since tag to the newly added method Extend the IOCaseTestCase to check for proper handling of case-insensitivity with Non-ASCIIs and guard against regressions of this issue. You should be able to simply copy the code from FilenameUtilsWildcardTestCase.testLocaleIndependence() (which is hopefully integrated nevertheless) around and change one or two lines to test checkIndexOf() . Niall's patch in place, IOCase.convertCase() seems unused and could be deleted.
        Niall Pemberton made changes -
        Attachment IO-167-checkIndexOf.patch [ 12383118 ]
        Hide
        Niall Pemberton added a comment -

        Heres my suggestion to change wildcard matching:

        • add a new checkIndexOf() method to IOCase
        • change the FilenameUtils wildcardMatch() method to use IOCase's checkRegionMatches() and checkIndexOf() method which use String's underlying regionMatches() method http://tinyurl.com/252m3k
        Show
        Niall Pemberton added a comment - Heres my suggestion to change wildcard matching: add a new checkIndexOf() method to IOCase change the FilenameUtils wildcardMatch() method to use IOCase's checkRegionMatches() and checkIndexOf() method which use String's underlying regionMatches() method http://tinyurl.com/252m3k
        Niall Pemberton made changes -
        Fix Version/s 2.0 [ 12312961 ]
        Hide
        Niall Pemberton added a comment -

        Benjamin, thanks for the explanation, I have applied the FileSystemUtils part of the patch:
        http://svn.apache.org/viewvc?view=rev&revision=661646

        > Is wildcardMatch() meant to be platform-dependent?

        I wasn't around when the IOCase functionality was developed, so I don't know the original intent and I guess that the issue wasn't even considered. AFAIK, in all the other places its used, its used in conjunction with String's equalsIgnoreCase() so IMO I think we should make it consistent with that.

        Show
        Niall Pemberton added a comment - Benjamin, thanks for the explanation, I have applied the FileSystemUtils part of the patch: http://svn.apache.org/viewvc?view=rev&revision=661646 > Is wildcardMatch() meant to be platform-dependent? I wasn't around when the IOCase functionality was developed, so I don't know the original intent and I guess that the issue wasn't even considered. AFAIK, in all the other places its used, its used in conjunction with String's equalsIgnoreCase() so IMO I think we should make it consistent with that.
        Hide
        Benjamin Bentmann added a comment -

        Wish I could think of a good example for a platform-dependent use-case, but I feel sure that someone will need it ...

        You're perfectly right, just because one cannot spontaneouly think up a use-case doesn't mean there is none. My only wish is that the behavior of IOCase.INSENSITIVE is changed to be locale-independent as proposed. Basically because I consider this the major use-case which people had implicitly in mind when they used this matching in existing code. To support the other use-case: What about simply adding a new IOCase.INSENSITIVE_LOCALE_AWARE or something?

        Show
        Benjamin Bentmann added a comment - Wish I could think of a good example for a platform-dependent use-case, but I feel sure that someone will need it ... You're perfectly right, just because one cannot spontaneouly think up a use-case doesn't mean there is none. My only wish is that the behavior of IOCase.INSENSITIVE is changed to be locale-independent as proposed. Basically because I consider this the major use-case which people had implicitly in mind when they used this matching in existing code. To support the other use-case: What about simply adding a new IOCase.INSENSITIVE_LOCALE_AWARE or something?
        Hide
        Sebb added a comment -

        You could move this down into this method body, if it helps to catch the eye.

        I think the comment needs to remain in the Javadoc.
        The upcase/downcase line needs a separate comment to say that this is effectively how String.ignoreCase() has to do it to cope with odd Locales.

        LICENSE use case

        Good example. Wish I could think of a good example for a platform-dependent use-case, but I feel sure that someone will need it ...

        Show
        Sebb added a comment - You could move this down into this method body, if it helps to catch the eye. I think the comment needs to remain in the Javadoc. The upcase/downcase line needs a separate comment to say that this is effectively how String.ignoreCase() has to do it to cope with odd Locales. LICENSE use case Good example. Wish I could think of a good example for a platform-dependent use-case, but I feel sure that someone will need it ...
        Hide
        Benjamin Bentmann added a comment - - edited

        Might be an idea to add a comment to the patch explaining that this is necessary to agree with String.equalsIgnoreCase().

        I agree, these are the subtle things that are good to document. From the Javadoc of the method (latest patch):

        * <strong>Note:</strong> The return value of this method does not necessarily match
        * the return value from {@link String#toLowerCase()}. Instead, the return value is
        * constructed to guarantee the following condition: <code>str1.equalsIgnoreCase(str2)</code>
        * if and only if <code>convertCase(str1).equals(convertCase(str2))</code>.
        

        You could move this down into this method body, if it helps to catch the eye. But ultimatively, the unit test covers this code path, omitting either transformation to lower or upper case will fail the test.

        As to whether wildcardMatch() should be platform-dependent or independent, there are probably use-cases for both.

        To advocate for the platform-independence, this is the use case I have in mind: Consider an open-source project with a world-wide operating dev community. Let's in particular assume that some Turkish developers participate. Let's say this project has some license file hanging around in their sources, named "LICENSE". This license should be picked up by some wildcard-based pattern, e.g. "license". As the case of the file name is usually quite irrelevant for the distro, people might want to do case-insensitive wildcard matching here. Now, for our Turkish team-mates the file name match fails.

        Show
        Benjamin Bentmann added a comment - - edited Might be an idea to add a comment to the patch explaining that this is necessary to agree with String.equalsIgnoreCase(). I agree, these are the subtle things that are good to document. From the Javadoc of the method (latest patch): * <strong>Note:</strong> The return value of this method does not necessarily match * the return value from {@link String #toLowerCase()}. Instead, the return value is * constructed to guarantee the following condition: <code>str1.equalsIgnoreCase(str2)</code> * if and only if <code>convertCase(str1).equals(convertCase(str2))</code>. You could move this down into this method body, if it helps to catch the eye. But ultimatively, the unit test covers this code path, omitting either transformation to lower or upper case will fail the test. As to whether wildcardMatch() should be platform-dependent or independent, there are probably use-cases for both. To advocate for the platform-independence, this is the use case I have in mind: Consider an open-source project with a world-wide operating dev community. Let's in particular assume that some Turkish developers participate. Let's say this project has some license file hanging around in their sources, named "LICENSE". This license should be picked up by some wildcard-based pattern, e.g. "license". As the case of the file name is usually quite irrelevant for the distro, people might want to do case-insensitive wildcard matching here. Now, for our Turkish team-mates the file name match fails.
        Hide
        Sebb added a comment -

        It looks odd that the patch for convertCase() upcases and then downcases the characters.
        Might be an idea to add a comment to the patch explaining that this is necessary to agree with String.equalsIgnoreCase().

        I agree that that FileSystemUtils needs to use Locale.ENGLISH for OS name comparisons.

        As to whether wildcardMatch() should be platform-dependent or independent, there are probably use-cases for both.
        But whatever is decided - maybe have two versions? - the Javadoc needs to make it clear (and it needs to work in Turkey!)

        Show
        Sebb added a comment - It looks odd that the patch for convertCase() upcases and then downcases the characters. Might be an idea to add a comment to the patch explaining that this is necessary to agree with String.equalsIgnoreCase(). I agree that that FileSystemUtils needs to use Locale.ENGLISH for OS name comparisons. As to whether wildcardMatch() should be platform-dependent or independent, there are probably use-cases for both. But whatever is decided - maybe have two versions? - the Javadoc needs to make it clear (and it needs to work in Turkey!)
        Benjamin Bentmann made changes -
        Attachment IO-167.patch [ 12382733 ]
        Hide
        Benjamin Bentmann added a comment -

        Is wildcardMatch() meant to be platform-dependent?

        If yes, my first patch is still not sufficiently fixing the problem. To catch up with the behavior of String.equalsIgnoreCase(), one needs to consider both the lower case and the upper case form of a character. New patch with extended unit test.

        Show
        Benjamin Bentmann added a comment - Is wildcardMatch() meant to be platform-dependent? If yes, my first patch is still not sufficiently fixing the problem. To catch up with the behavior of String.equalsIgnoreCase() , one needs to consider both the lower case and the upper case form of a character. New patch with extended unit test.
        Benjamin Bentmann made changes -
        Attachment IO-167-a.patch [ 12382721 ]
        Hide
        Benjamin Bentmann added a comment - - edited

        Here's a slightly extended version of the patch which can reveal the defect in FileSystemUtils. Although it properly failed for me "mvn test" with the unpatched FileSystemUtils, I would recommend to run the test individually to make sure nothing triggers the static initialization of FileSystemUtils before the static initializer in the test class is run. This patch is not meant as a replacement for my previous patch but should merely serve for illustration purposes.

        Show
        Benjamin Bentmann added a comment - - edited Here's a slightly extended version of the patch which can reveal the defect in FileSystemUtils . Although it properly failed for me "mvn test" with the unpatched FileSystemUtils , I would recommend to run the test individually to make sure nothing triggers the static initialization of FileSystemUtils before the static initializer in the test class is run. This patch is not meant as a replacement for my previous patch but should merely serve for illustration purposes.
        Hide
        Benjamin Bentmann added a comment -

        I don't believe the FileSystemUtils changes will make any difference to their operation

        I'm not sure whether you did not read my mentioned mail post or it just wasn't clear enough, so I will try to explain again. The correctness of FileSystemUtils depends on its capability to correctly detect the underlying OS. This detection is based on recognition of known OS names which - for resiliency - is intended to be case-insensitive. If you're familar with the Unicode standard, you will remember that character casing for Non-English languages is a non-trivial thing. As just one example, the Turkish language defines the lower case form of "I" to be "ı" (dotless i). In other words, if a JVM runs on the Turkish locale and the system property "os.name" returns "IRIX", "UNIX", "MPE/IX" or "SOLARIS", the unpatched FileSystemUtils will not detect the OS. As a consequence, freeSpaceOs() fails with an exception.

        So when you doubt the patch will make a difference to the operation, is that because you believe the outlined preconditions will never occur or because an exception doesn't make a difference to you?

        the package-private IOCase convertCase() method is only used by the FilenameUtils's wildcardMatch() method

        Just one question for my own understanding: Is wildcardMatch() meant to be platform-dependent? In other words, would it be considered correct for the method if a call with argument IOCase.INSENSITIVE returns different matches based on the user's locale?

        it seems wrong to me to hard-code English in principle

        "believe", "seems"... with all respect, correctness is nothing about a gut feeling. I have no problems if somebody proves me wrong, but such a proof must be based on specs, APIs or otherwise authorative materials.

        From the API docs for String.toLowerCase():

        To obtain correct results for locale insensitive strings, use toLowerCase(Locale.ENGLISH)

        I believe that file names should be understood as locale insensitive strings, as a matter of interoperability, but that assumption might be wrong.

        Using the English locale for the case conversion will not limit the code to ASCII characters, if this was your concern. It will merely fix the behavior of String.to*erCase() to platform-independent conversion rules. If you look at the source code for to*erCase() you will notice that is has an if for the languages "tr", "az" and "lt". The selection of Locale.ENGLISH is quite arbitrary, Locale.GERMAN or Locale.FRENCH will equally work well, the key point is to avoid the if regardless of the user's locale.

        Back to Unicode, case conversions can be defined in terms of isolated 1:1 character mappings or context-sensitive m:n mappings matching some written language. In most cases (e.g. when you don't want to produce text for human consumption), Java codes seeks for platform-independence which implies locale-independence. Unicode offers this via the 1:1 character mappings, available via Character.to*erCase() and String.equalsIgnoreCase(). If one wants to approximate this behavior using String.to*erCase(), one must lock the locale.

        Show
        Benjamin Bentmann added a comment - I don't believe the FileSystemUtils changes will make any difference to their operation I'm not sure whether you did not read my mentioned mail post or it just wasn't clear enough, so I will try to explain again. The correctness of FileSystemUtils depends on its capability to correctly detect the underlying OS. This detection is based on recognition of known OS names which - for resiliency - is intended to be case-insensitive. If you're familar with the Unicode standard, you will remember that character casing for Non-English languages is a non-trivial thing. As just one example, the Turkish language defines the lower case form of "I" to be "ı" (dotless i). In other words, if a JVM runs on the Turkish locale and the system property "os.name" returns "IRIX", "UNIX", "MPE/IX" or "SOLARIS", the unpatched FileSystemUtils will not detect the OS. As a consequence, freeSpaceOs() fails with an exception. So when you doubt the patch will make a difference to the operation, is that because you believe the outlined preconditions will never occur or because an exception doesn't make a difference to you? the package-private IOCase convertCase() method is only used by the FilenameUtils's wildcardMatch() method Just one question for my own understanding: Is wildcardMatch() meant to be platform-dependent? In other words, would it be considered correct for the method if a call with argument IOCase.INSENSITIVE returns different matches based on the user's locale? it seems wrong to me to hard-code English in principle "believe", "seems"... with all respect, correctness is nothing about a gut feeling. I have no problems if somebody proves me wrong, but such a proof must be based on specs, APIs or otherwise authorative materials. From the API docs for String.toLowerCase() : To obtain correct results for locale insensitive strings, use toLowerCase(Locale.ENGLISH) I believe that file names should be understood as locale insensitive strings, as a matter of interoperability, but that assumption might be wrong. Using the English locale for the case conversion will not limit the code to ASCII characters, if this was your concern. It will merely fix the behavior of String.to*erCase() to platform-independent conversion rules. If you look at the source code for to*erCase() you will notice that is has an if for the languages "tr", "az" and "lt". The selection of Locale.ENGLISH is quite arbitrary, Locale.GERMAN or Locale.FRENCH will equally work well, the key point is to avoid the if regardless of the user's locale. Back to Unicode, case conversions can be defined in terms of isolated 1:1 character mappings or context-sensitive m:n mappings matching some written language. In most cases (e.g. when you don't want to produce text for human consumption), Java codes seeks for platform-independence which implies locale-independence. Unicode offers this via the 1:1 character mappings, available via Character.to*erCase() and String.equalsIgnoreCase() . If one wants to approximate this behavior using String.to*erCase() , one must lock the locale.
        Hide
        Niall Pemberton added a comment -

        I think we should close this as WONTFIX

        • I don't believe the FileSystemUtils changes will make any difference to their operation
        • the package-private IOCase convertCase() method is only used by the FilenameUtils's wildcardMatch() method - it seems wrong to me to hard-code English in principle
        Show
        Niall Pemberton added a comment - I think we should close this as WONTFIX I don't believe the FileSystemUtils changes will make any difference to their operation the package-private IOCase convertCase() method is only used by the FilenameUtils's wildcardMatch() method - it seems wrong to me to hard-code English in principle
        Benjamin Bentmann made changes -
        Field Original Value New Value
        Attachment IO-167.patch [ 12380930 ]
        Benjamin Bentmann created issue -

          People

          • Assignee:
            Niall Pemberton
            Reporter:
            Benjamin Bentmann
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development