Harmony
  1. Harmony
  2. HARMONY-6290

BufferedReader.readLine() breaks at EBCDIC newline, violating the spec

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0M11
    • Component/s: Classlib
    • Labels:
      None
    • Environment:
      SVN Revision: 800827
    • Patch Info:
      Patch Available
    • Estimated Complexity:
      Moderate

      Description

      The spec says that BufferedReader.readLine() considers only "\r", "\n" and "\r\n" to be line separators. We must not permit additional separator characters. I admit that the RI's behaviour is surprising, and incompatible with it's own Pattern and Scanner classes. But this is the specified behaviour; the doc explicitly calls out which character sequences are used as newlines. It does not permit additional characters to break lines.

      For users reading EBCDIC-encoded files, a better practice is to read through the files using a Scanner. That way, the application will behave the same when executed on either Harmony or on the RI.

      #Android

        Activity

        Hide
        Jesse Wilson added a comment -

        Test case and fix. Also adding some extra tests for newline splitting - currently this coverage is lacking.

        Show
        Jesse Wilson added a comment - Test case and fix. Also adding some extra tests for newline splitting - currently this coverage is lacking.
        Hide
        Nathan Beyer added a comment -

        Anyone know what IBM's JRE does in this case?

        Show
        Nathan Beyer added a comment - Anyone know what IBM's JRE does in this case?
        Hide
        Nathan Beyer added a comment -

        I suspect it is in the IBM JREs, as it was specifically added by Oliver, which I presume is for support on zOS.

        https://svn.apache.org/viewvc?view=rev&revision=593201
        https://svn.apache.org/viewvc?view=rev&revision=593604

        Show
        Nathan Beyer added a comment - I suspect it is in the IBM JREs, as it was specifically added by Oliver, which I presume is for support on zOS. https://svn.apache.org/viewvc?view=rev&revision=593201 https://svn.apache.org/viewvc?view=rev&revision=593604
        Hide
        Oliver Deakin added a comment -

        Hi Jesse/Nathan,

        Running the following test:

        import java.io.*;
        class FileTest {
        public static void main(String[] args) throws Exception

        { BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("test.txt"), "IBM-1047")); System.out.println(br.readLine()); }

        }

        against an EBCDIC test.txt file containing:
        Hello<NEL>
        World<NEL>
        <EOF>

        on Windows produces the same output from Harmony, IBM and the RI:

        Hello

        In other words, even when we are not on zOS platforms the RI treats the NEL character as a newline character when reading a file in EBCDIC. If we remove the encoding specified to InputStreamReader, so we revert to reading the file in the native encoding for the Windows platform, then again all 3 jdks have matching behaviour. When we read the EBCDIC file in the native Windows encoding the NEL hex value (0x15) is not mapped to the unicode NEL character (\u0085) and is just treated as a normal character.

        So it appears that our code currently behaves the same as the RI, even if the spec does not mention this special case for the EBCDIC character set. I'm not sure if there is any way for a character to get mapped to the NEL unicode character (\u0085) when we are not working in EBCDIC, so it may be the case that the code we have right now has the correct logic.

        We could err on the safe side and add an encoding check (check we are reading in EBCDIC and then check if the character is \u0085) but since we already seem to match the RI behaviour I'm not sure if that is necessary. What are your thoughts?

        Regards,
        Oliver

        Show
        Oliver Deakin added a comment - Hi Jesse/Nathan, Running the following test: import java.io.*; class FileTest { public static void main(String[] args) throws Exception { BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("test.txt"), "IBM-1047")); System.out.println(br.readLine()); } } against an EBCDIC test.txt file containing: Hello<NEL> World<NEL> <EOF> on Windows produces the same output from Harmony, IBM and the RI: Hello In other words, even when we are not on zOS platforms the RI treats the NEL character as a newline character when reading a file in EBCDIC. If we remove the encoding specified to InputStreamReader, so we revert to reading the file in the native encoding for the Windows platform, then again all 3 jdks have matching behaviour. When we read the EBCDIC file in the native Windows encoding the NEL hex value (0x15) is not mapped to the unicode NEL character (\u0085) and is just treated as a normal character. So it appears that our code currently behaves the same as the RI, even if the spec does not mention this special case for the EBCDIC character set. I'm not sure if there is any way for a character to get mapped to the NEL unicode character (\u0085) when we are not working in EBCDIC, so it may be the case that the code we have right now has the correct logic. We could err on the safe side and add an encoding check (check we are reading in EBCDIC and then check if the character is \u0085) but since we already seem to match the RI behaviour I'm not sure if that is necessary. What are your thoughts? Regards, Oliver
        Hide
        Jesse Wilson added a comment -

        Oliver, I don't think your example shows us anything. The RI may read the entire file as a single line, and println the string "Hello<NEL>World", which may print up as two lines in whichever terminal emulator you're using.

        There are two opportunities for the NEL character to be interpretted by Java:
        1 - when the file is converted from a stream of bytes into a stream of chars by the InputStreamReader (which knows about the EBCDIC encoding)
        2 - when the stream of characters is converted to a stream of lines by the BufferedReader (which does not know any encoding)

        I'm not at all interested in #1. Support for converting streams of bytes into streams of characters works just fine. Perhaps the EBCDIC decoder should decode the byte 0x85 into the Java newline character, "\n".

        But according to the spec, using a BufferedReader to read through the String "Hello<NEL>World" should yield a single line. Please see the attached test case which demonstrates the inconsistency between Hy and the RI.

        Show
        Jesse Wilson added a comment - Oliver, I don't think your example shows us anything. The RI may read the entire file as a single line, and println the string "Hello<NEL>World", which may print up as two lines in whichever terminal emulator you're using. There are two opportunities for the NEL character to be interpretted by Java: 1 - when the file is converted from a stream of bytes into a stream of chars by the InputStreamReader (which knows about the EBCDIC encoding) 2 - when the stream of characters is converted to a stream of lines by the BufferedReader (which does not know any encoding) I'm not at all interested in #1. Support for converting streams of bytes into streams of characters works just fine. Perhaps the EBCDIC decoder should decode the byte 0x85 into the Java newline character, "\n". But according to the spec, using a BufferedReader to read through the String "Hello<NEL>World" should yield a single line. Please see the attached test case which demonstrates the inconsistency between Hy and the RI.
        Hide
        Oliver Deakin added a comment -

        Ok, I see where you're coming from now - I altered the test to read the file byte by byte rather than a line at a time, and BufferedReader is getting an LF character (0x0A) in place of NEL from the InputStreamReader on the RI, but not on Harmony. This behaviour is the same on both zOS and Windows, so I think you are correct in saying that the decoder for EBCDIC should map NEL to LF.

        I will take a look at InputStreamReader and see if I can fix the behaviour there. For now I will revert my original commit and add your test case to the suite.

        Show
        Oliver Deakin added a comment - Ok, I see where you're coming from now - I altered the test to read the file byte by byte rather than a line at a time, and BufferedReader is getting an LF character (0x0A) in place of NEL from the InputStreamReader on the RI, but not on Harmony. This behaviour is the same on both zOS and Windows, so I think you are correct in saying that the decoder for EBCDIC should map NEL to LF. I will take a look at InputStreamReader and see if I can fix the behaviour there. For now I will revert my original commit and add your test case to the suite.
        Hide
        Oliver Deakin added a comment -

        Fix and test case applied with minor change at repo revision r801230 - please check it applied as expected.

        Show
        Oliver Deakin added a comment - Fix and test case applied with minor change at repo revision r801230 - please check it applied as expected.
        Hide
        Jesse Wilson added a comment -

        Looks Good To Me. Thanks Oliver.

        Show
        Jesse Wilson added a comment - Looks Good To Me. Thanks Oliver.
        Hide
        Oliver Deakin added a comment -

        Verified - thanks Jesse.

        Show
        Oliver Deakin added a comment - Verified - thanks Jesse.

          People

          • Assignee:
            Oliver Deakin
            Reporter:
            Jesse Wilson
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 20m
              20m
              Remaining:
              Remaining Estimate - 20m
              20m
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development