Lucene - Core
  1. Lucene - Core
  2. LUCENE-6563

MockFileSystemTestCase.testURI should be improved to handle cases where OS/JVM cannot create non-ASCII filenames

    Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      ant test -Dtestcase=TestVerboseFS -Dtests.method=testURI -Dtests.file.encoding=UTF-8 fails (for example) with 'Oracle Corporation 1.8.0_45 (64-bit)' when the default sun.jnu.encoding system property is (for example) ANSI_X3.4-1968

      [details to follow]

      1. LUCENE-6563.patch
        3 kB
        Uwe Schindler
      2. LUCENE-6563.patch
        2 kB
        Uwe Schindler

        Activity

        Hide
        ASF GitHub Bot added a comment -

        GitHub user cpoerschke opened a pull request:

        https://github.com/apache/lucene-solr/pull/152

        LUCENE-6563: sun.jnu.encoding to match file.encoding system property

        for https://issues.apache.org/jira/i#browse/LUCENE-6563

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/bloomberg/lucene-solr trunk-sun-jnu-encoding

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/lucene-solr/pull/152.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #152


        commit 58baae43edff917d46d1799d0e24b7f7ab828565
        Author: Christine Poerschke <cpoerschke@bloomberg.net>
        Date: 2015-06-12T19:00:25Z

        LUCENE-????: changed common-build.xml to ensure sun.jnu.encoding matches file.encoding system property

        changes:

        • adding -Dtests.sun.jnu.encoding to common-build.xml (it is to take the same value as -Dtests.file.encoding)
        • add to RunListenerPrintReproduceInfo.java (for illustration purposes only) the sun.jnu.encoding system property

        background:

        • ant test -Dtestcase=TestVerboseFS -Dtests.method=testURI -Dtests.file.encoding=UTF-8
          fails (for example) with 'Oracle Corporation 1.8.0_45 (64-bit)' when the
          default sun.jnu.encoding system property is (for example) ANSI_X3.4-1968

        related links/tickets:

        test failure details:

        NOTE: reproduce with: ant test -Dtestcase=TestVerboseFS -Dtests.method=testURI -Dtests.seed=31F5C6E85DAF4E6B -Dtests.slow=true -Dtests.locale=no -Dtests.timezone=Australia/Melbourne -Dtests.asserts=true -Dtests.file.encoding=UTF-8
        0.12s | TestVerboseFS.testURI <<<
        Throwable #1: java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: ??
        at __randomizedtesting.SeedInfo.seed([31F5C6E85DAF4E6B:B847BD4395B0145A]:0)
        at sun.nio.fs.UnixPath.encode(UnixPath.java:147)
        at sun.nio.fs.UnixPath.<init>(UnixPath.java:71)
        at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
        at sun.nio.fs.AbstractPath.resolve(AbstractPath.java:53)
        at org.apache.lucene.mockfile.FilterPath.resolve(FilterPath.java:156)
        at org.apache.lucene.mockfile.MockFileSystemTestCase.testURI(MockFileSystemTestCase.java:71)
        at java.lang.Thread.run(Thread.java:745)

        test case code snippet:

        MockFileSystemTestCase.testURI
        ...
        69 assumeTrue(Charset.defaultCharset().name() + " can't encode chinese",
        70 Charset.defaultCharset().newEncoder().canEncode("中å<9B>1/2"));
        71 Path f3 = dir.resolve("中å<9B>1/2");
        ...


        Show
        ASF GitHub Bot added a comment - GitHub user cpoerschke opened a pull request: https://github.com/apache/lucene-solr/pull/152 LUCENE-6563 : sun.jnu.encoding to match file.encoding system property for https://issues.apache.org/jira/i#browse/LUCENE-6563 You can merge this pull request into a Git repository by running: $ git pull https://github.com/bloomberg/lucene-solr trunk-sun-jnu-encoding Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/152.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #152 commit 58baae43edff917d46d1799d0e24b7f7ab828565 Author: Christine Poerschke <cpoerschke@bloomberg.net> Date: 2015-06-12T19:00:25Z LUCENE-????: changed common-build.xml to ensure sun.jnu.encoding matches file.encoding system property changes: adding -Dtests.sun.jnu.encoding to common-build.xml (it is to take the same value as -Dtests.file.encoding) add to RunListenerPrintReproduceInfo.java (for illustration purposes only) the sun.jnu.encoding system property background: ant test -Dtestcase=TestVerboseFS -Dtests.method=testURI -Dtests.file.encoding=UTF-8 fails (for example) with 'Oracle Corporation 1.8.0_45 (64-bit)' when the default sun.jnu.encoding system property is (for example) ANSI_X3.4-1968 related links/tickets: https://netbeans.org/bugzilla/show_bug.cgi?id=246438#c24 http://happygiraffe.net/blog/2009/09/24/java-platform-encoding/ http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8003228 test failure details: NOTE: reproduce with: ant test -Dtestcase=TestVerboseFS -Dtests.method=testURI -Dtests.seed=31F5C6E85DAF4E6B -Dtests.slow=true -Dtests.locale=no -Dtests.timezone=Australia/Melbourne -Dtests.asserts=true -Dtests.file.encoding=UTF-8 0.12s | TestVerboseFS.testURI <<< Throwable #1: java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: ?? at __randomizedtesting.SeedInfo.seed( [31F5C6E85DAF4E6B:B847BD4395B0145A] :0) at sun.nio.fs.UnixPath.encode(UnixPath.java:147) at sun.nio.fs.UnixPath.<init>(UnixPath.java:71) at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281) at sun.nio.fs.AbstractPath.resolve(AbstractPath.java:53) at org.apache.lucene.mockfile.FilterPath.resolve(FilterPath.java:156) at org.apache.lucene.mockfile.MockFileSystemTestCase.testURI(MockFileSystemTestCase.java:71) at java.lang.Thread.run(Thread.java:745) test case code snippet: MockFileSystemTestCase.testURI ... 69 assumeTrue(Charset.defaultCharset().name() + " can't encode chinese", 70 Charset.defaultCharset().newEncoder().canEncode("中å<9B>1/2")); 71 Path f3 = dir.resolve("中å<9B>1/2"); ...
        Hide
        Ramkumar Aiyengar added a comment -

        Uwe Schindler, Dawid Weiss, any opinions on this? I can pick this up if this looks good..

        Show
        Ramkumar Aiyengar added a comment - Uwe Schindler , Dawid Weiss , any opinions on this? I can pick this up if this looks good..
        Hide
        Dawid Weiss added a comment -

        I would like to see a reproducible failure first. This is an internal JVM's property – why is it causing problems?

        Show
        Dawid Weiss added a comment - I would like to see a reproducible failure first. This is an internal JVM's property – why is it causing problems?
        Hide
        Ramkumar Aiyengar added a comment - - edited

        This reproduceably fails for me..

        LANG=C ant test -Dtestcase=TestVerboseFS -Dtests.method=testURI -Dtests.seed=31F5C6E85DAF4E6B -Dtests.slow=true -Dtests.locale=no -Dtests.timezone=Australia/Melbourne -Dtests.asserts=true -Dtests.file.encoding=UTF-8

        With my default locale (LANG=en_GB.UTF-8), things work fine..

        LANG=en_GB causes the test to fail as well..

        Show
        Ramkumar Aiyengar added a comment - - edited This reproduceably fails for me.. LANG=C ant test -Dtestcase=TestVerboseFS -Dtests.method=testURI -Dtests.seed=31F5C6E85DAF4E6B -Dtests.slow=true -Dtests.locale=no -Dtests.timezone=Australia/Melbourne -Dtests.asserts=true -Dtests.file.encoding=UTF-8 With my default locale (LANG=en_GB.UTF-8), things work fine.. LANG=en_GB causes the test to fail as well..
        Hide
        Dawid Weiss added a comment -

        This is indeed an interesting one.

        sun.jnu.encoding is a very deep, internal, JVM-specific and largely undocumented property that controls how unicode characters are mapped into bytes to encode file names. Looking at OpenJDK code I think it dates back to times when most filesystem-related APIs were byte-based (not unicode-based), so such a conversion was required to somehow translate Java's default string representation into operating system's defaults.

        OpenJDK tries to figure it out internally (on Windows and on Unix systems) with sun.jnu.encoding functioning as an override.

        I don't think the patch should be incorporated though. This is an attempt to dodge an odd behavior of the JVM itself; if you specify LANG=C then this means that the JVM won't be able to write any filenames with non-ASCII characters. The test is failing and I think it should be failing – it fails exactly at the check it was supposed to do. We shouldn't be trying to improve the defaults.

        I think there are two solutions – either we agree that all Lucene-related filesystem operations use only ASCII (and this can be verified at mock filesystem level), then the test can be corrected as well. Or we agree that filesystems with crippled unicode support are not supported. Then the test should keep failing (with a better message perhaps).

        Show
        Dawid Weiss added a comment - This is indeed an interesting one. sun.jnu.encoding is a very deep, internal, JVM-specific and largely undocumented property that controls how unicode characters are mapped into bytes to encode file names. Looking at OpenJDK code I think it dates back to times when most filesystem-related APIs were byte-based (not unicode-based), so such a conversion was required to somehow translate Java's default string representation into operating system's defaults. OpenJDK tries to figure it out internally (on Windows and on Unix systems) with sun.jnu.encoding functioning as an override. I don't think the patch should be incorporated though. This is an attempt to dodge an odd behavior of the JVM itself; if you specify LANG=C then this means that the JVM won't be able to write any filenames with non-ASCII characters. The test is failing and I think it should be failing – it fails exactly at the check it was supposed to do. We shouldn't be trying to improve the defaults. I think there are two solutions – either we agree that all Lucene-related filesystem operations use only ASCII (and this can be verified at mock filesystem level), then the test can be corrected as well. Or we agree that filesystems with crippled unicode support are not supported. Then the test should keep failing (with a better message perhaps).
        Hide
        Uwe Schindler added a comment -

        I think there are two solutions – either we agree that all Lucene-related filesystem operations use only ASCII (and this can be verified at mock filesystem level), then the test can be corrected as well. Or we agree that filesystems with crippled unicode support are not supported. Then the test should keep failing (with a better message perhaps).

        • "and this can be verified at mock filesystem level": we should only verify this below Lucene directories (so we dont create new filenames for segments in non-ASCII), but we should not assume that e.g. the test directory or the directory where a user has placed the index is ASCII-only. E.g. on a Windows system with Umlauts in User name you have both a whitespace and a Umlaut in the path name of the home directory.
        • I think the test should keep failing in this configuration. This is something that the test has setup in a wrong way.

        We should never ever change the sun.jnu.encoding property, because this may produce broken filenames on some system (that you cannot even delete later from command line...).

        Show
        Uwe Schindler added a comment - I think there are two solutions – either we agree that all Lucene-related filesystem operations use only ASCII (and this can be verified at mock filesystem level), then the test can be corrected as well. Or we agree that filesystems with crippled unicode support are not supported. Then the test should keep failing (with a better message perhaps). "and this can be verified at mock filesystem level": we should only verify this below Lucene directories (so we dont create new filenames for segments in non-ASCII), but we should not assume that e.g. the test directory or the directory where a user has placed the index is ASCII-only. E.g. on a Windows system with Umlauts in User name you have both a whitespace and a Umlaut in the path name of the home directory. I think the test should keep failing in this configuration. This is something that the test has setup in a wrong way. We should never ever change the sun.jnu.encoding property, because this may produce broken filenames on some system (that you cannot even delete later from command line...).
        Hide
        Dawid Weiss added a comment -

        > but we should not assume that e.g. the test directory or the directory where a user has placed the index is ASCII-only.

        Yeah, I agree. I think this test should keep failing. When somebody has LANG=C and this causes the JVM not to be able to properly encode certain filenames then this is a problem of the test environment (and arguably a JVM bug if the filesystem an handle unicode properly).

        Show
        Dawid Weiss added a comment - > but we should not assume that e.g. the test directory or the directory where a user has placed the index is ASCII-only. Yeah, I agree. I think this test should keep failing. When somebody has LANG=C and this causes the JVM not to be able to properly encode certain filenames then this is a problem of the test environment (and arguably a JVM bug if the filesystem an handle unicode properly).
        Hide
        Dawid Weiss added a comment -

        Christine Poerschke could you shed more light on the software/ hardware platform where this originally occurred?

        Show
        Dawid Weiss added a comment - Christine Poerschke could you shed more light on the software/ hardware platform where this originally occurred?
        Hide
        Christine Poerschke added a comment -

        Dawid Weiss and Uwe Schindler - thanks for taking a look at this.

        It's consistently reproducible for -Dtests.file.encoding=UTF-8 both running on the dev machines our team usually uses and also within the jenkins instance when that runs the test suite.

        The dev machines have LANG=C environment by default and junit reports Linux 2.6.32-358.11.1.el6.x86_64 amd64/Oracle Corporation 1.8.0_45 (64-bit) or Linux 2.6.32-358.41.1.el6.x86_64 amd64/Oracle Corporation 1.8.0_45 (64-bit) versions.

        On the machines where the jenkins job runs, including a 'locale' step shows LANG= and junit reports Linux 2.6.18-238.1.1.el5 amd64/Oracle Corporation 1.8.0_45 (64-bit) versions.

        Show
        Christine Poerschke added a comment - Dawid Weiss and Uwe Schindler - thanks for taking a look at this. It's consistently reproducible for -Dtests.file.encoding=UTF-8 both running on the dev machines our team usually uses and also within the jenkins instance when that runs the test suite. The dev machines have LANG=C environment by default and junit reports Linux 2.6.32-358.11.1.el6.x86_64 amd64/Oracle Corporation 1.8.0_45 (64-bit) or Linux 2.6.32-358.41.1.el6.x86_64 amd64/Oracle Corporation 1.8.0_45 (64-bit) versions. On the machines where the jenkins job runs, including a 'locale' step shows LANG= and junit reports Linux 2.6.18-238.1.1.el5 amd64/Oracle Corporation 1.8.0_45 (64-bit) versions.
        Hide
        Uwe Schindler added a comment -

        Thanks.

        I dont see this as a bug.

        Show
        Uwe Schindler added a comment - Thanks. I dont see this as a bug.
        Hide
        Ramkumar Aiyengar added a comment -

        I see why we shouldn't modify the internal prop – it sounded hacky to begin with, but I still think our tests shouldn't fail for some supported locale on a supported OS (I was able to reproduce this on the latest Ubuntu on my laptop).

        To the best of my knowledge, Lucene doesn't need non-ASCII file names for itself – so if we are not able to create non-ASCII filenames to begin with, shouldn't the mock filesystem not try to use them? That way we continue to test that if the OS was able to create non-ASCII parent directories, Lucene should continue to work fine, but that shouldn't be a requirement for Lucene per se.

        Show
        Ramkumar Aiyengar added a comment - I see why we shouldn't modify the internal prop – it sounded hacky to begin with, but I still think our tests shouldn't fail for some supported locale on a supported OS (I was able to reproduce this on the latest Ubuntu on my laptop). To the best of my knowledge, Lucene doesn't need non-ASCII file names for itself – so if we are not able to create non-ASCII filenames to begin with, shouldn't the mock filesystem not try to use them? That way we continue to test that if the OS was able to create non-ASCII parent directories, Lucene should continue to work fine, but that shouldn't be a requirement for Lucene per se.
        Hide
        Dawid Weiss added a comment -

        I honestly think this is a quirk/bug in the JVM... and perhaps should be reported. Setting LANG=C shouldn't be affecting how filenames are (mis)handled (and it currently does).

        so if we are not able to create non-ASCII filenames to begin with, shouldn't the mock filesystem not try to use them?

        Well, it's a good test, isn't it? Caught the nasty thing right away... I don't have a strong opinion. I think modifying the mock filesystem to enforce ASCII-only names and them removing that non-ascii test section sounds all right. Could you provide a patch and check if it passes on your systems?

        Show
        Dawid Weiss added a comment - I honestly think this is a quirk/bug in the JVM... and perhaps should be reported. Setting LANG=C shouldn't be affecting how filenames are (mis)handled (and it currently does). so if we are not able to create non-ASCII filenames to begin with, shouldn't the mock filesystem not try to use them? Well, it's a good test, isn't it? Caught the nasty thing right away... I don't have a strong opinion. I think modifying the mock filesystem to enforce ASCII-only names and them removing that non-ascii test section sounds all right. Could you provide a patch and check if it passes on your systems?
        Hide
        Christine Poerschke added a comment -

        Sure. Will give that a go.

        Show
        Christine Poerschke added a comment - Sure. Will give that a go.
        Hide
        Uwe Schindler added a comment -

        OK, I aggree.

        But as I said, MockFileSystem should not validate all file name components, only those that are created by Lucene: E.g. If my checkout is named C:/Users/Uwe Schindler/trünk this must still work.

        Show
        Uwe Schindler added a comment - OK, I aggree. But as I said, MockFileSystem should not validate all file name components, only those that are created by Lucene: E.g. If my checkout is named C:/Users/Uwe Schindler/trünk this must still work.
        Hide
        Dawid Weiss added a comment -

        C:/Users/Uwe Schindler/trünk

        LOL. I think this should become the default for our svn as well...

        Show
        Dawid Weiss added a comment - C:/Users/Uwe Schindler/trünk LOL. I think this should become the default for our svn as well...
        Hide
        Robert Muir added a comment -

        But as I said, MockFileSystem should not validate all file name components, only those that are created by Lucene

        Then there is no point in doing anything. So please don't add necessary confusing code.

        The point of the current test logic is to throw an assumption if the system cannot encode unicode. The test just needs to be improved.

        Show
        Robert Muir added a comment - But as I said, MockFileSystem should not validate all file name components, only those that are created by Lucene Then there is no point in doing anything. So please don't add necessary confusing code. The point of the current test logic is to throw an assumption if the system cannot encode unicode. The test just needs to be improved.
        Hide
        ASF GitHub Bot added a comment -

        GitHub user cpoerschke opened a pull request:

        https://github.com/apache/lucene-solr/pull/182

        LUCENE-6563: tweak MockFileSystemTestCase.testURI assumptions

        for https://issues.apache.org/jira/i#browse/LUCENE-6563

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/bloomberg/lucene-solr trunk-lucene-6563

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/lucene-solr/pull/182.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #182


        commit 6d792bba409fe080efad30c1e03f060f3c66a039
        Author: Christine Poerschke <cpoerschke@bloomberg.net>
        Date: 2015-07-06T13:34:12Z

        LUCENE-6563: tweak MockFileSystemTestCase.testURI assumptions

        • testURI itself now is only for plain ASCII file name
        • chinese file name now is in testURIchinese
        • also added a testURIumlaute file name case
        • implTestURI factored out to hold the test logic itself (if resolve fails for non-ASCII file names then the toUri part of the test is skipped)

        Show
        ASF GitHub Bot added a comment - GitHub user cpoerschke opened a pull request: https://github.com/apache/lucene-solr/pull/182 LUCENE-6563 : tweak MockFileSystemTestCase.testURI assumptions for https://issues.apache.org/jira/i#browse/LUCENE-6563 You can merge this pull request into a Git repository by running: $ git pull https://github.com/bloomberg/lucene-solr trunk-lucene-6563 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/182.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #182 commit 6d792bba409fe080efad30c1e03f060f3c66a039 Author: Christine Poerschke <cpoerschke@bloomberg.net> Date: 2015-07-06T13:34:12Z LUCENE-6563 : tweak MockFileSystemTestCase.testURI assumptions testURI itself now is only for plain ASCII file name chinese file name now is in testURIchinese also added a testURIumlaute file name case implTestURI factored out to hold the test logic itself (if resolve fails for non-ASCII file names then the toUri part of the test is skipped)
        Hide
        Uwe Schindler added a comment -

        I like this test more. Maybe Robert Muir has a look, too. The assumeNoException is fine, but I think with that we can remove the assume for chinese - bcause its subsumed by the assumeNoException!?

        Show
        Uwe Schindler added a comment - I like this test more. Maybe Robert Muir has a look, too. The assumeNoException is fine, but I think with that we can remove the assume for chinese - bcause its subsumed by the assumeNoException!?
        Hide
        Dawid Weiss added a comment -

        Don't want to be picky, but Charset.defaultCharset() isn't exactly what's used for filename encoding... most of the time it will be the same thing though so I think it's still an improvement.

        For the record, all of this charset-to-byte related stuff is a legacy headache... check out the interesting comments in openjdk if you're interested.

                        /* On windows the system locale may be different than the
                         * user locale. This is an unsupported configuration, [...]
        
        Show
        Dawid Weiss added a comment - Don't want to be picky, but Charset.defaultCharset() isn't exactly what's used for filename encoding... most of the time it will be the same thing though so I think it's still an improvement. For the record, all of this charset-to-byte related stuff is a legacy headache... check out the interesting comments in openjdk if you're interested. /* On windows the system locale may be different than the * user locale. This is an unsupported configuration, [...]
        Hide
        Uwe Schindler added a comment -

        Don't want to be picky, but Charset.defaultCharset() isn't exactly what's used for filename encoding... most of the time it will be the same thing though so I think it's still an improvement.

        this is why I said:

        ...but I think with that we can remove the assume for chinese - bcause its subsumed by the assumeNoException!?

        So I think we should just put the resolve into try/catch and if this fails, cancel test.

        Show
        Uwe Schindler added a comment - Don't want to be picky, but Charset.defaultCharset() isn't exactly what's used for filename encoding... most of the time it will be the same thing though so I think it's still an improvement. this is why I said: ...but I think with that we can remove the assume for chinese - bcause its subsumed by the assumeNoException!? So I think we should just put the resolve into try/catch and if this fails, cancel test.
        Hide
        Uwe Schindler added a comment -

        Also I think we should remove the true/false parameter. ASCII should always pass, so why add condition?

        Show
        Uwe Schindler added a comment - Also I think we should remove the true/false parameter. ASCII should always pass, so why add condition?
        Hide
        Christine Poerschke added a comment -

        The boolean tryResolve flag was aiming to preserve existing logic i.e. not catching any InvalidPathException that dir.resolve("file1"); might throw. Happy to remove both it and the existing Charset.defaultCharset() in favour of just try/catch-ing on the resolve.

        Show
        Christine Poerschke added a comment - The boolean tryResolve flag was aiming to preserve existing logic i.e. not catching any InvalidPathException that dir.resolve("file1"); might throw. Happy to remove both it and the existing Charset.defaultCharset() in favour of just try/catch-ing on the resolve.
        Hide
        Uwe Schindler added a comment -

        Fine, thanks. Waiting for patch!

        Show
        Uwe Schindler added a comment - Fine, thanks. Waiting for patch!
        Hide
        Ramkumar Aiyengar added a comment -

        I honestly think this is a quirk/bug in the JVM... and perhaps should be reported. Setting LANG=C shouldn't be affecting how filenames are (mis)handled (and it currently does).

        I digged a bit, and it appears that Java is kind of doing the right thing given it's current API. Certain newer versions/FSs of Windows and MacOSX guarantee that all files are in UTF-16/8 respectively. Linux/Solaris etc. (aka the more traditional Unix systems) tend not to care about the encoding at all. As a matter of fact, a filename is just a sequence of bytes, it's not even a string. How that byte array is displayed comes down to the locale. This is probably why this works.

        $ LANG=C touch 中国
        

        touch doesn't care about the input, the shell maps it into a set of UTF-8 bytes, and is stored as a filename. ls, then, in an UTF-8 locale shows the correct thing

        $ ls
        build.xml  ivy.xml  lib  lucene-test-framework.iml  src  中国
        

        And if I try to read in the C locale, I get a bunch of unreadable chars

        $ LANG=C ls
        build.xml  ivy.xml  lib  lucene-test-framework.iml  src  ??????
        

        Java APIs on the other hand treats filenames as strings in all it's APIs, and as a result is forced to need an encoding even when it is using it only for I/O and not for display. As a result, it is forced to choose some encoding, and it goes with the locale.. In platforms where the filename encoding is guaranteed to be UTF-8, it goes with that – see http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8003228 for MacOSX.

        Looks like this issue is not specific to Java – see https://bugs.python.org/issue19846 this for example.

        Show
        Ramkumar Aiyengar added a comment - I honestly think this is a quirk/bug in the JVM... and perhaps should be reported. Setting LANG=C shouldn't be affecting how filenames are (mis)handled (and it currently does). I digged a bit, and it appears that Java is kind of doing the right thing given it's current API. Certain newer versions/FSs of Windows and MacOSX guarantee that all files are in UTF-16/8 respectively. Linux/Solaris etc. (aka the more traditional Unix systems) tend not to care about the encoding at all. As a matter of fact, a filename is just a sequence of bytes, it's not even a string. How that byte array is displayed comes down to the locale. This is probably why this works. $ LANG=C touch 中国 touch doesn't care about the input, the shell maps it into a set of UTF-8 bytes, and is stored as a filename. ls, then, in an UTF-8 locale shows the correct thing $ ls build.xml ivy.xml lib lucene-test-framework.iml src 中国 And if I try to read in the C locale, I get a bunch of unreadable chars $ LANG=C ls build.xml ivy.xml lib lucene-test-framework.iml src ?????? Java APIs on the other hand treats filenames as strings in all it's APIs, and as a result is forced to need an encoding even when it is using it only for I/O and not for display. As a result, it is forced to choose some encoding, and it goes with the locale.. In platforms where the filename encoding is guaranteed to be UTF-8, it goes with that – see http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8003228 for MacOSX. Looks like this issue is not specific to Java – see https://bugs.python.org/issue19846 this for example.
        Hide
        Dawid Weiss added a comment -

        As a matter of fact, a filename is just a sequence of bytes, it's not even a string.

        In the end most things are just a sequence of bytes, Ramkumar And seriously, standard C didn't have any unicode-related utilities for a looong time (because there was no unicode); strings were/are 0-byte terminated byte regions. The interpretation of which characters they constitute is a higher-level concept.

        LANG=C touch 中国

        The question is how does the terminal know how to decode your input above into an argument (itself being a byte*)... and how did it know what you typed in (and which glyphs to pick in order to display it)... I'm guessing the terminal accepts unicode on input then if it sees C locale it blindly passes the input bytes without any conversion at all. The unicode is very likely UTF-8, which was specifically designed to be an identity conversion coding page (so that C "strings" just work with it) and it just happens to be the default filesystem encoding... That's why it works, it just performs no conversion at all...

        It's no magic, really. But trying to understand how and where character-to-byte(-to-glyph) conversions occur will drive you nuts because there is no consistency here.

        Show
        Dawid Weiss added a comment - As a matter of fact, a filename is just a sequence of bytes, it's not even a string. In the end most things are just a sequence of bytes, Ramkumar And seriously, standard C didn't have any unicode-related utilities for a looong time (because there was no unicode); strings were/are 0-byte terminated byte regions. The interpretation of which characters they constitute is a higher-level concept. LANG=C touch 中国 The question is how does the terminal know how to decode your input above into an argument (itself being a byte*)... and how did it know what you typed in (and which glyphs to pick in order to display it)... I'm guessing the terminal accepts unicode on input then if it sees C locale it blindly passes the input bytes without any conversion at all. The unicode is very likely UTF-8, which was specifically designed to be an identity conversion coding page (so that C "strings" just work with it) and it just happens to be the default filesystem encoding... That's why it works, it just performs no conversion at all... It's no magic, really. But trying to understand how and where character-to-byte(-to-glyph) conversions occur will drive you nuts because there is no consistency here.
        Hide
        Ramkumar Aiyengar added a comment -

        My point is that the layer which stops caring about the sequence of bytes being a string is different in OSs. In MacOSX, right up to the FS understands it as UTF-8. In Linux, its at the application layer.

        Actually in this case, my terminal emulator was still in UTF-8, so it accepted Chinese chars. Which then got passed to touch as a stream of utf-8 bytes (a simple char* to void main()), and then touch, regardless of the locale, just created a filename with those bytes as input. Similarly ls just read these bytes from the directory, and on display alone tried to interpret it using the locale. In both these cases, the interface used with the OS passed in/out bytes and nothing more. Java on the other hand uses Strings with an implicit encoding for these inputs and hence is forced to interpret these bytes even if they are not being displayed or input.

        Show
        Ramkumar Aiyengar added a comment - My point is that the layer which stops caring about the sequence of bytes being a string is different in OSs. In MacOSX, right up to the FS understands it as UTF-8. In Linux, its at the application layer. Actually in this case, my terminal emulator was still in UTF-8, so it accepted Chinese chars. Which then got passed to touch as a stream of utf-8 bytes (a simple char* to void main()), and then touch, regardless of the locale, just created a filename with those bytes as input. Similarly ls just read these bytes from the directory, and on display alone tried to interpret it using the locale. In both these cases, the interface used with the OS passed in/out bytes and nothing more. Java on the other hand uses Strings with an implicit encoding for these inputs and hence is forced to interpret these bytes even if they are not being displayed or input.
        Hide
        Dawid Weiss added a comment -

        In MacOSX, right up to the FS understands it as UTF-8.

        I doubt it does anything special. Try forming an incorrect UTF-8 sequence in a C application and creating a file on disk. If it fails, it means it is interpreted as such (but I don't think it will).

        Or mount a FAT-32 filesystem and create some UTF-8 files in there, then re-mount it under Windows. Good luck.

        As for Java, yes, the fact that the API uses UTF-16 has been a problem (or a blessing, depends how you look at it). Till this day I see ways to "fix" codepage conversion problems with horrible things like new String(bytesFromNetwork, "ISO-8859-1")...

        Show
        Dawid Weiss added a comment - In MacOSX, right up to the FS understands it as UTF-8. I doubt it does anything special. Try forming an incorrect UTF-8 sequence in a C application and creating a file on disk. If it fails, it means it is interpreted as such (but I don't think it will). Or mount a FAT-32 filesystem and create some UTF-8 files in there, then re-mount it under Windows. Good luck. As for Java, yes, the fact that the API uses UTF-16 has been a problem (or a blessing, depends how you look at it). Till this day I see ways to "fix" codepage conversion problems with horrible things like new String(bytesFromNetwork, "ISO-8859-1") ...
        Hide
        Christine Poerschke added a comment -

        patch updated. thanks.

        Show
        Christine Poerschke added a comment - patch updated. thanks.
        Hide
        Uwe Schindler added a comment -

        Just reopening to apply that patch.

        Show
        Uwe Schindler added a comment - Just reopening to apply that patch.
        Hide
        Uwe Schindler added a comment -

        I like the current patch. Very clean and simple. If the Path does not resolve on the original filesystem, cancel test.

        Show
        Uwe Schindler added a comment - I like the current patch. Very clean and simple. If the Path does not resolve on the original filesystem, cancel test.
        Hide
        Uwe Schindler added a comment -

        I attached Christine's patch. Tests pass for me, so I would like to commit this soon. Any comments from MacOSX users?

        Show
        Uwe Schindler added a comment - I attached Christine's patch. Tests pass for me, so I would like to commit this soon. Any comments from MacOSX users?
        Hide
        Uwe Schindler added a comment -

        By the way, the chinese test now passes on windows with default (8bit) charset "windows-1252" (my machine), because the underlying filesystem can still encode chinese chars... SO this is an improvments, the previous assume was just wrong, because it relied on the default charset not the one used by filesystem internally. So the assumeNoException() is much better.

        Show
        Uwe Schindler added a comment - By the way, the chinese test now passes on windows with default (8bit) charset "windows-1252" (my machine), because the underlying filesystem can still encode chinese chars... SO this is an improvments, the previous assume was just wrong, because it relied on the default charset not the one used by filesystem internally. So the assumeNoException() is much better.
        Hide
        Ramkumar Aiyengar added a comment -

        Uwe Schindler, looks like this is the original patch and not the one without the Charset assume etc.?

        Show
        Ramkumar Aiyengar added a comment - Uwe Schindler , looks like this is the original patch and not the one without the Charset assume etc.?
        Hide
        Uwe Schindler added a comment -

        Sorry uploaded wrong patch. I HATE GIT!!! Die, GIT, die!

        Show
        Uwe Schindler added a comment - Sorry uploaded wrong patch. I HATE GIT!!! Die, GIT, die!
        Hide
        Christine Poerschke added a comment -

        Interesting, looks like https://github.com/apache/lucene-solr/pull/182 contains both commits (original and arg removal follow-up) but https://github.com/apache/lucene-solr/pull/182.patch still has just the original commit.

        Show
        Christine Poerschke added a comment - Interesting, looks like https://github.com/apache/lucene-solr/pull/182 contains both commits (original and arg removal follow-up) but https://github.com/apache/lucene-solr/pull/182.patch still has just the original commit.
        Hide
        Uwe Schindler added a comment -

        I had the same problem initially. I solved the whole thing by just downloading the raw file and copypaste them over mine
        I just missed to create a new patch from my subversion checkout!

        Show
        Uwe Schindler added a comment - I had the same problem initially. I solved the whole thing by just downloading the raw file and copypaste them over mine I just missed to create a new patch from my subversion checkout!
        Hide
        Uwe Schindler added a comment -

        https://patch-diff.githubusercontent.com/raw/apache/lucene-solr/pull/182.diff works better! I have no idea why this is like that but this is just horrible!

        Show
        Uwe Schindler added a comment - https://patch-diff.githubusercontent.com/raw/apache/lucene-solr/pull/182.diff works better! I have no idea why this is like that but this is just horrible!
        Hide
        Ramkumar Aiyengar added a comment -

        I think it's one patch per commit, and the patch file has multiple commits in it. git tools probably understand this multi-patch format, and may be svn doesn't..

        Show
        Ramkumar Aiyengar added a comment - I think it's one patch per commit, and the patch file has multiple commits in it. git tools probably understand this multi-patch format, and may be svn doesn't..
        Hide
        Christine Poerschke added a comment -

        Yeah, just noticed same, very subtle! Never mind git or svn, as human reader would i wish to scroll through multi-patches?

        182.diff much clearer in my opinion. The ASF GitHub Bot wording mentions 182.patch but not 182.diff - is what it mentions configurable? Might a case be made for including a 182.diff link also?

        Show
        Christine Poerschke added a comment - Yeah, just noticed same, very subtle! Never mind git or svn, as human reader would i wish to scroll through multi-patches? 182.diff much clearer in my opinion. The ASF GitHub Bot wording mentions 182.patch but not 182.diff - is what it mentions configurable? Might a case be made for including a 182.diff link also?
        Hide
        Dawid Weiss added a comment - - edited

        The .diff is a (consolidated) unified diff from the branch your pull request goes into.

        The .patch contains all commits (diffs for each individual commit) so that you could recreate the full history of your pull request if you wanted to do it from command-line, for example.

        Show
        Dawid Weiss added a comment - - edited The .diff is a (consolidated) unified diff from the branch your pull request goes into. The .patch contains all commits (diffs for each individual commit) so that you could recreate the full history of your pull request if you wanted to do it from command-line, for example.
        Hide
        ASF subversion and git services added a comment -

        Commit 1689768 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1689768 ]

        LUCENE-6563: Improve MockFileSystemTestCase.testURI to check if a path can be encoded according to local filesystem requirements. Otherwise stop test execution

        Show
        ASF subversion and git services added a comment - Commit 1689768 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1689768 ] LUCENE-6563 : Improve MockFileSystemTestCase.testURI to check if a path can be encoded according to local filesystem requirements. Otherwise stop test execution
        Hide
        ASF subversion and git services added a comment -

        Commit 1689769 from Uwe Schindler in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1689769 ]

        Merged revision(s) 1689768 from lucene/dev/trunk:
        LUCENE-6563: Improve MockFileSystemTestCase.testURI to check if a path can be encoded according to local filesystem requirements. Otherwise stop test execution

        Show
        ASF subversion and git services added a comment - Commit 1689769 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1689769 ] Merged revision(s) 1689768 from lucene/dev/trunk: LUCENE-6563 : Improve MockFileSystemTestCase.testURI to check if a path can be encoded according to local filesystem requirements. Otherwise stop test execution
        Hide
        Uwe Schindler added a comment -

        Thanks Christine!

        Show
        Uwe Schindler added a comment - Thanks Christine!
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release
        Hide
        ASF GitHub Bot added a comment -

        Github user cpoerschke closed the pull request at:

        https://github.com/apache/lucene-solr/pull/152

        Show
        ASF GitHub Bot added a comment - Github user cpoerschke closed the pull request at: https://github.com/apache/lucene-solr/pull/152
        Hide
        ASF GitHub Bot added a comment -

        Github user cpoerschke closed the pull request at:

        https://github.com/apache/lucene-solr/pull/182

        Show
        ASF GitHub Bot added a comment - Github user cpoerschke closed the pull request at: https://github.com/apache/lucene-solr/pull/182

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Christine Poerschke
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development