Log4j 2
  1. Log4j 2
  2. LOG4J2-255

Multi-byte character strings are scrambled in log output

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0-beta6
    • Fix Version/s: 2.0-beta7
    • Component/s: Appenders, Core
    • Labels:
      None

      Description

      When I tried to log a Japanese string the output was scrambled in both the Console and a log file.

      For example,
      logger.warn("日本語テスト"); // (Japanese test)

      came out as
      15:07:00.184 [main] WARN test.JapaneseTest - 譌・譛ャ隱槭ユ繧ケ繝?

      This is the log4j2.xml configuration:

      <?xml version="1.0" encoding="UTF-8"?>
      <configuration status="warn">
          <appenders>
              <Console name="Console" target="SYSTEM_OUT">
                  <PatternLayout>
                      <pattern>%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n
                      </pattern>
                  </PatternLayout>
              </Console>
              <File name="tracelog" fileName="trace-log.txt" immediateFlush="true" append="false">
                  <PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/>
              </File>
          </appenders>
          
          <loggers>
              <root level="trace">
                  <appender-ref ref="Console"/>
                  <appender-ref ref="tracelog"/>
              </root>
          </loggers>
      </configuration>
      

        Issue Links

          Activity

          Hide
          Remko Popma added a comment - - edited

          Analysis:
          If no character set is specified in the PatternLayout, UTF-8 is used.
          However, in my experience programmers save their source code in the platform encoding.
          The default should be the platform encoding, not UTF-8.

          Fix: in o.a.l.l.c.helpers.Charsets#getSupportedCharset (line 48),
          replace
          charset = UTF_8;
          with
          charset = Charset.defaultCharset();

          I'll commit this when I get home tonight.

          Show
          Remko Popma added a comment - - edited Analysis: If no character set is specified in the PatternLayout, UTF-8 is used. However, in my experience programmers save their source code in the platform encoding. The default should be the platform encoding, not UTF-8. Fix: in o.a.l.l.c.helpers.Charsets#getSupportedCharset (line 48), replace charset = UTF_8; with charset = Charset.defaultCharset(); I'll commit this when I get home tonight.
          Hide
          Remko Popma added a comment -

          Fixed in revision 1482944.

          This is an important one for Asian users.

          Show
          Remko Popma added a comment - Fixed in revision 1482944. This is an important one for Asian users.
          Hide
          Gary Gregory added a comment -

          The fix is not right IMO.

          The goal (unstated, I suppose, so it's just in my head) of getSupportedCharset is to return a known charset, the one that is passed in, or ... UTF-8 (IMO). UTF-8 is known, the platform default could be anything.

          The change "charset = Charset.defaultCharset()" just happened to make the test work because YOUR platform encoding supports JP chars, this will not always be the case, so the test will fail for some developers.

          This is a case where the test should be configured to write to a file (and Console) with a JP charset.

          Show
          Gary Gregory added a comment - The fix is not right IMO. The goal (unstated, I suppose, so it's just in my head) of getSupportedCharset is to return a known charset, the one that is passed in, or ... UTF-8 (IMO). UTF-8 is known, the platform default could be anything. The change "charset = Charset.defaultCharset()" just happened to make the test work because YOUR platform encoding supports JP chars, this will not always be the case, so the test will fail for some developers. This is a case where the test should be configured to write to a file (and Console) with a JP charset.
          Hide
          Remko Popma added a comment -

          Gary, thanks for reviewing the changes!

          I did not fully understand your comments, let me know if I interpreted them correctly:
          1) Charsets.getSupportedCharset should not return the platform default encoding, but UTF-8, because then the method will return a constant value.
          2) The JUnit tests I wrote will fail for some developers
          3) The JUnit tests are not complete, an additional test is needed that actually writes to a file

          Let me reply to these one by one.

          1)
          I don't know about the spec for this method, but going on current usage,
          getSupportedCharset is used by all Layouts to (a) either validate a specified encoding - may replace this if unsupported, or (b) provide a default encoding if the user did not specify an encoding in the layout configuration. I guess (b) is most common.

          The key point is that to prevent scrambled messages in the log file, the encoding of the source code with the call to Logger.log must match the encoding used to write the message to the log file. If getSupportedCharset always returns a constant UTF-8 then the log file will always contain scrambled messages unless the user saves their source code in UTF-8.

          Most developers save their source code in the platform encoding. Eclipse and NetBeans save source code in the platform encoding by default (sorry, I don't know about IntelliJ). Hence the Layouts should use the platform encoding when converting chars to bytes (unless an encoding is specified in the config).

          2)
          I believe this is a misunderstanding. KOI8-R is an encoding for Russian, not JP. It is actually part of the Basic Encodings and included in lib/rt.jar
          This test should work for everyone.

          3)
          You are probably right.

          Show
          Remko Popma added a comment - Gary, thanks for reviewing the changes! I did not fully understand your comments, let me know if I interpreted them correctly: 1) Charsets.getSupportedCharset should not return the platform default encoding, but UTF-8, because then the method will return a constant value. 2) The JUnit tests I wrote will fail for some developers 3) The JUnit tests are not complete, an additional test is needed that actually writes to a file Let me reply to these one by one. 1) I don't know about the spec for this method, but going on current usage, getSupportedCharset is used by all Layouts to (a) either validate a specified encoding - may replace this if unsupported, or (b) provide a default encoding if the user did not specify an encoding in the layout configuration. I guess (b) is most common. The key point is that to prevent scrambled messages in the log file, the encoding of the source code with the call to Logger.log must match the encoding used to write the message to the log file. If getSupportedCharset always returns a constant UTF-8 then the log file will always contain scrambled messages unless the user saves their source code in UTF-8. Most developers save their source code in the platform encoding. Eclipse and NetBeans save source code in the platform encoding by default (sorry, I don't know about IntelliJ). Hence the Layouts should use the platform encoding when converting chars to bytes (unless an encoding is specified in the config). 2) I believe this is a misunderstanding. KOI8-R is an encoding for Russian, not JP. It is actually part of the Basic Encodings and included in lib/rt.jar This test should work for everyone. 3) You are probably right.
          Hide
          Gary Gregory added a comment -

          I think I might have figured it out; the bottom line is that you should really specify an encoding for your appenders. Neither UTF-8 nor the platform default is right, but the platform default is likely worse.

          1) Let me walk us through it: You are saying that if I encode my .java source files (I will leave aside the scenario where messages are in an external file like a .properties files) in encoding X, then I should set the encoding for my appenders to match X. However, I can use Unicode escapes in a Java String to create any kind of Unicode String, no matter what encoding I am using in my .java source file. Because Java Strings are Unicode strings, it does not matter what the encoding of the source file is, the compiler and JVM use Unicode strings. If you use an encoding in your .java files that is not the platform encoding, then you have to tell the compiler about the encoding in order for the compiler to read the source correctly and convert the Java bytes to String objects. As we all know, if I have a Java String and I want bytes, I need to use an encoding to convert the String to bytes. Therefore, the encoding of the source file is irrelevant. What matters is that the JVM has a Unicode String object and we need to give it an encoding to get bytes to write some place.

          So, back to the Russians: If the JVM has a (Unicode) String with Cyrillic characters, I had better give it an encoding that knows what to do with these characters; ASCII will not do for example. If I rely on the platform encoding, who knows what I will get. On Windows for example, there are no Cyrillic characters in the default encoding Cp1252. If UTF-8 is the default – recall that UTF stands for Unicode Transformation Format – I should get better results.

          The bottom line is that for predictable results, I should always use an encoding in my configuration and not rely on the platform encoding. If I do not specify and default to the platform encoding, sometimes I will get expected results, sometimes I will get junk and other times different kinds of junk, all depending on the platform. On the other hand, if I do not specify and default to UTF-8, I’m likely to get better results and if I do get junk, it will be the same junk on all platforms.

          2) The only encodings you can count on are the six listed in http://docs.oracle.com/javase/6/docs/api/java/nio/charset/Charset.html. In practice, different JRE and JDK implementations provide many additional encodings, but these are not required. Therefore, in order to write portable tests, we should not expect them to be there, we should only rely on the required six. See my changes to CharsetTests.java.
          Please help me find holes in this or support it

          Show
          Gary Gregory added a comment - I think I might have figured it out; the bottom line is that you should really specify an encoding for your appenders. Neither UTF-8 nor the platform default is right, but the platform default is likely worse. 1) Let me walk us through it: You are saying that if I encode my .java source files (I will leave aside the scenario where messages are in an external file like a .properties files) in encoding X, then I should set the encoding for my appenders to match X. However, I can use Unicode escapes in a Java String to create any kind of Unicode String, no matter what encoding I am using in my .java source file. Because Java Strings are Unicode strings, it does not matter what the encoding of the source file is, the compiler and JVM use Unicode strings. If you use an encoding in your .java files that is not the platform encoding, then you have to tell the compiler about the encoding in order for the compiler to read the source correctly and convert the Java bytes to String objects. As we all know, if I have a Java String and I want bytes, I need to use an encoding to convert the String to bytes. Therefore, the encoding of the source file is irrelevant. What matters is that the JVM has a Unicode String object and we need to give it an encoding to get bytes to write some place. So, back to the Russians: If the JVM has a (Unicode) String with Cyrillic characters, I had better give it an encoding that knows what to do with these characters; ASCII will not do for example. If I rely on the platform encoding, who knows what I will get. On Windows for example, there are no Cyrillic characters in the default encoding Cp1252. If UTF-8 is the default – recall that UTF stands for Unicode Transformation Format – I should get better results. The bottom line is that for predictable results, I should always use an encoding in my configuration and not rely on the platform encoding. If I do not specify and default to the platform encoding, sometimes I will get expected results, sometimes I will get junk and other times different kinds of junk, all depending on the platform. On the other hand, if I do not specify and default to UTF-8, I’m likely to get better results and if I do get junk, it will be the same junk on all platforms. 2) The only encodings you can count on are the six listed in http://docs.oracle.com/javase/6/docs/api/java/nio/charset/Charset.html . In practice, different JRE and JDK implementations provide many additional encodings, but these are not required. Therefore, in order to write portable tests, we should not expect them to be there, we should only rely on the required six. See my changes to CharsetTests.java. Please help me find holes in this or support it
          Hide
          Remko Popma added a comment -

          1. I have to eat humble pie here and admit you are completely right about the source file encoding being irrelevant. The JVM has a unicode string in memory. I think I confused myself when I was doing some tests with changing the editor encoding on a source file with Japanese in Eclipse. My bad.

          I understand your point that using UTF-8 as the default would be predictable. This is especially useful when a log file is read in an environment with a different platform encoding than where the log file was written.

          It just does not seem right that if I log a Japanese string to the Console in my Japanese environment it comes out scrambled.
          Let me check how log4j-1.2 and logback handle this.

          2. I was looking at http://docs.oracle.com/javase/6/docs/technotes/guides/intl/encoding.doc.html but I agree these are only for the Oracle JVM. Thanks for the pointer. I'll remove the Russian encoding from the JUnit test.

          Show
          Remko Popma added a comment - 1. I have to eat humble pie here and admit you are completely right about the source file encoding being irrelevant. The JVM has a unicode string in memory. I think I confused myself when I was doing some tests with changing the editor encoding on a source file with Japanese in Eclipse. My bad. I understand your point that using UTF-8 as the default would be predictable. This is especially useful when a log file is read in an environment with a different platform encoding than where the log file was written. It just does not seem right that if I log a Japanese string to the Console in my Japanese environment it comes out scrambled. Let me check how log4j-1.2 and logback handle this. 2. I was looking at http://docs.oracle.com/javase/6/docs/technotes/guides/intl/encoding.doc.html but I agree these are only for the Oracle JVM. Thanks for the pointer. I'll remove the Russian encoding from the JUnit test.
          Hide
          Nick Williams added a comment -

          Yes, Gary is completely right about Java's strings always being unicode.

          I'm a bit confused here. My (admittedly limited) understanding of character sets was that UTF-8 takes care of everything. English, Cyrillic, Korean, Japanese, etc. should all be able to be properly represented using UTF-8. That's why I'm a bit uncertain about why we can't always use UTF-8 for everything. The only exception I can think of is /reading/ files, which would have been created by something else (text editor, other program) and need to have their encoding detected/specified. But why wouldn't UTF-8 work for everything that Log4j writes/transmits?

          Show
          Nick Williams added a comment - Yes, Gary is completely right about Java's strings always being unicode. I'm a bit confused here. My (admittedly limited) understanding of character sets was that UTF-8 takes care of everything . English, Cyrillic, Korean, Japanese, etc. should all be able to be properly represented using UTF-8. That's why I'm a bit uncertain about why we can't always use UTF-8 for everything . The only exception I can think of is /reading/ files, which would have been created by something else (text editor, other program) and need to have their encoding detected/specified. But why wouldn't UTF-8 work for everything that Log4j writes/transmits?
          Hide
          Remko Popma added a comment -

          If the default for writing is UTF-8, then if I log a Japanese string to the Console in my Japanese environment it comes out scrambled, because the Console is in the platform default encoding.

          (Same with reading log files written in UTF-8, unless I tell the editor/viewer explicitly to use UTF-8. Most Japanese editors use the platform encoding by default, and often have an option to switch to another encoding.)

          Show
          Remko Popma added a comment - If the default for writing is UTF-8, then if I log a Japanese string to the Console in my Japanese environment it comes out scrambled, because the Console is in the platform default encoding. (Same with reading log files written in UTF-8, unless I tell the editor/viewer explicitly to use UTF-8. Most Japanese editors use the platform encoding by default, and often have an option to switch to another encoding.)
          Hide
          Ralph Goers added a comment -

          Nick, while UTF-8 is capable of representing characters in many languages most computers don't display characters on the screen in Unicode. They use what IBM calls code pages. For example, Gary mentioned cp 1252 - cp stands for code page. http://en.wikipedia.org/wiki/Code_page gives a simple explanation of what they are. So the problem is that although you may have data in Unicode, to display it on the screen so that it is viewable it must be converted to the proper code page. Since Strings in Java are always UTF-8, when you call getBytes() on the string passing in a charset allows Java to convert the UTF-8 into the target code page, provided that the OS has the definition for the code page installed. This is why Layouts accept a charset parameter. The charset Java's name for a code page.

          What I don't understand here is that if Remko is generating logs in UTF-8 that contain Japanese characters and is specifying the proper Japanese code page for the host computer why it is generating unreadable stuff. If no charset is specified then it is perfectly understandable why this would be happening.

          Note that this is actually the proper way to performa internationalization/localization - the Strings should be manipulated in UTF-8 and passed from client to server that way and only converted to the target code page when they are actually displayed.

          Show
          Ralph Goers added a comment - Nick, while UTF-8 is capable of representing characters in many languages most computers don't display characters on the screen in Unicode. They use what IBM calls code pages. For example, Gary mentioned cp 1252 - cp stands for code page. http://en.wikipedia.org/wiki/Code_page gives a simple explanation of what they are. So the problem is that although you may have data in Unicode, to display it on the screen so that it is viewable it must be converted to the proper code page. Since Strings in Java are always UTF-8, when you call getBytes() on the string passing in a charset allows Java to convert the UTF-8 into the target code page, provided that the OS has the definition for the code page installed. This is why Layouts accept a charset parameter. The charset Java's name for a code page. What I don't understand here is that if Remko is generating logs in UTF-8 that contain Japanese characters and is specifying the proper Japanese code page for the host computer why it is generating unreadable stuff. If no charset is specified then it is perfectly understandable why this would be happening. Note that this is actually the proper way to performa internationalization/localization - the Strings should be manipulated in UTF-8 and passed from client to server that way and only converted to the target code page when they are actually displayed.
          Hide
          Remko Popma added a comment -

          Ralph, when you say "Remko ... is specifying the proper Japanese code page for the host computer", what do you mean?
          Do you mean specifying a charset in the Layout section of the log4j2.xml?

          Show
          Remko Popma added a comment - Ralph, when you say "Remko ... is specifying the proper Japanese code page for the host computer", what do you mean? Do you mean specifying a charset in the Layout section of the log4j2.xml?
          Hide
          Ralph Goers added a comment -

          Yes. Are you specifying the target code page on your Layout? If not, Charsets.getSupportedCharset() will return Charset.defaultCharset(), which may or may not be what you want. You would probably need to write a simple test case to see what it returns (or stop it in the debugger).

          Show
          Ralph Goers added a comment - Yes. Are you specifying the target code page on your Layout? If not, Charsets.getSupportedCharset() will return Charset.defaultCharset(), which may or may not be what you want. You would probably need to write a simple test case to see what it returns (or stop it in the debugger).
          Hide
          Remko Popma added a comment - - edited

          Ralph, ok. I understand what you mean now.

          The configuration I used does not specify any charset encoding. This gave the problem described in the summary.

          I fixed this in trunk, so now Charsets.getSupportedCharset() will return Charset.defaultCharset().
          This fixes the issue and logging a Japanese string to the console is no longer scrambled.

          My earlier comment referred to Gary's suggestion that the fix is wrong and Charsets.getSupportedCharset() should return "UTF-8".
          In that case logging a Japanese string to the console will give scrambled output by default.

          It is true that users could fix that by putting their platform encoding in the Layout configuration, but it is my opinion that logging to the console should just work out of the box.

          Show
          Remko Popma added a comment - - edited Ralph, ok. I understand what you mean now. The configuration I used does not specify any charset encoding. This gave the problem described in the summary. I fixed this in trunk, so now Charsets.getSupportedCharset() will return Charset.defaultCharset(). This fixes the issue and logging a Japanese string to the console is no longer scrambled. My earlier comment referred to Gary's suggestion that the fix is wrong and Charsets.getSupportedCharset() should return "UTF-8". In that case logging a Japanese string to the console will give scrambled output by default. It is true that users could fix that by putting their platform encoding in the Layout configuration, but it is my opinion that logging to the console should just work out of the box.
          Hide
          Ralph Goers added a comment -

          FWIW - Log4j 1.x uses a WriterAppender as the base for most String based appenders that write to an OutputStream. You specify the charset on the Appender and the OutputStreamWriter is passed the encoding. It will convert the Strings it is passed into the target charset. Log4j 2 just moves the encoding into the Layouts so that Appenders can handle any kind of data.

          Show
          Ralph Goers added a comment - FWIW - Log4j 1.x uses a WriterAppender as the base for most String based appenders that write to an OutputStream. You specify the charset on the Appender and the OutputStreamWriter is passed the encoding. It will convert the Strings it is passed into the target charset. Log4j 2 just moves the encoding into the Layouts so that Appenders can handle any kind of data.
          Hide
          Remko Popma added a comment -

          I see. What encoding does Log4j 1.x use if no charset is specified on the Appender?

          Show
          Remko Popma added a comment - I see. What encoding does Log4j 1.x use if no charset is specified on the Appender?
          Hide
          Ralph Goers added a comment -

          Yes - I think Gary's example has one thing incorrect. I would expect that a Windows computer in Russia would have been localized to that environment and would have a Cyrillic code page as the default for the OS. That is why Windows and most other operating systems ask where you are during the OS installation. I seem to recall that the internationalized version of Windows also gave the option to install additional code pages - but I may be remembering incorrectly since it has been a while since I last installed Windows.

          Show
          Ralph Goers added a comment - Yes - I think Gary's example has one thing incorrect. I would expect that a Windows computer in Russia would have been localized to that environment and would have a Cyrillic code page as the default for the OS. That is why Windows and most other operating systems ask where you are during the OS installation. I seem to recall that the internationalized version of Windows also gave the option to install additional code pages - but I may be remembering incorrectly since it has been a while since I last installed Windows.
          Hide
          Ralph Goers added a comment -

          As for Log4j 1.x. - if the encoding is null then it will use the system default.

          Show
          Ralph Goers added a comment - As for Log4j 1.x. - if the encoding is null then it will use the system default.
          Hide
          Gary Gregory added a comment -

          I do not understand how using UTF-8 scrambles the text. This is an
          issue that needs explaining, could there be a different bug?

          Here is where using the platform default breaks when no charset is
          defined in the config: you set up an app on your JP system and you get
          some logging. You give me your config and I run it on my US Windows
          system and it comes out scrambled. If the config specifics the
          charset, all output on all systems is OK.

          If no charset is set in the config, I claim that if UTF-8 is used I
          the right places, all should be well. But, it sounds like Remko is
          seeing otherwise. So either, I do not understand how UTF-8 works,
          there is bug, or we are missing some info from Remko. Arg.

          Gary

          Show
          Gary Gregory added a comment - I do not understand how using UTF-8 scrambles the text. This is an issue that needs explaining, could there be a different bug? Here is where using the platform default breaks when no charset is defined in the config: you set up an app on your JP system and you get some logging. You give me your config and I run it on my US Windows system and it comes out scrambled. If the config specifics the charset, all output on all systems is OK. If no charset is set in the config, I claim that if UTF-8 is used I the right places, all should be well. But, it sounds like Remko is seeing otherwise. So either, I do not understand how UTF-8 works, there is bug, or we are missing some info from Remko. Arg. Gary
          Hide
          Remko Popma added a comment - - edited

          Gary, you are describing the different use case of sharing log files between environments that use different encodings.
          Fair enough, you need to tell the recipient what encoding the file is in.

          In the use case I am describing, writing and reading is happening on the same machine.
          If you write UTF-8 bytes to the Console, the Console will interpret these bytes as being in the platform encoding.

          This works for English text because the platform encoding (I assume US-ASCII) is a subset of UTF-8, so you cannot test this on your machine.
          But for Japanese text, if you take UTF-8 bytes for "日本語" and you interpret these bytes as MS932 (the platform encoding for Japanese Windows), you end up with scrambled text.

          Show
          Remko Popma added a comment - - edited Gary, you are describing the different use case of sharing log files between environments that use different encodings. Fair enough, you need to tell the recipient what encoding the file is in. In the use case I am describing, writing and reading is happening on the same machine. If you write UTF-8 bytes to the Console, the Console will interpret these bytes as being in the platform encoding. This works for English text because the platform encoding (I assume US-ASCII) is a subset of UTF-8, so you cannot test this on your machine. But for Japanese text, if you take UTF-8 bytes for "日本語" and you interpret these bytes as MS932 (the platform encoding for Japanese Windows), you end up with scrambled text.
          Hide
          Ralph Goers added a comment - - edited

          Gary, that is absolutely incorrect. If I specify characters in the Japanese encoding range of UTF-8 and you are running on a non-Japenese computer, odds are you are going to end up with garbage. Now what you are saying would be true on my Mac because it has a LANG setting of en_US.UTF-8 - so the display IS apparently actually capable of displaying all the characters. However, note that in this case the default encoding would also be correct. But if my computer was set to en_US.cp1252 the default encoding wouldn't work correctly, but neither would the configured code page if I don't have that installed on my machine, and if my computer screen can't render UTF-8 then there is no way to display any Japanese characters.

          Show
          Ralph Goers added a comment - - edited Gary, that is absolutely incorrect. If I specify characters in the Japanese encoding range of UTF-8 and you are running on a non-Japenese computer, odds are you are going to end up with garbage. Now what you are saying would be true on my Mac because it has a LANG setting of en_US.UTF-8 - so the display IS apparently actually capable of displaying all the characters. However, note that in this case the default encoding would also be correct. But if my computer was set to en_US.cp1252 the default encoding wouldn't work correctly, but neither would the configured code page if I don't have that installed on my machine, and if my computer screen can't render UTF-8 then there is no way to display any Japanese characters.
          Hide
          Nick Williams added a comment -

          Okay, I think I understand all of this better. The 100% correct solution that will work all of the time is to change all of the computers in the world to have a default UTF-8 platform encoding. Too bad we don't have the power to do that...

          Here's what I think should be happening:

          Internally, absolutely everything should be handled UTF-8 for consistency's sake. However, when dealing with external resources:

          • Data transmitted over the wire or interprocess (such as net, flume, etc.) should use UTF-8 exclusively.
          • XML written to a file or other non-network output stream should use UTF-8 exclusively.
          • Data read from files or other non-network input streams should detect the file encoding (is this possible? do we have to just rely on the platform default here?) and read in that file encoding, converting to Unicode upon reading (which should happen automatically, since all Strings in Java are Unicode). My understanding of XML is that you SHOULD always encode it a Unicode variant such as UTF-8, UTF-16, etc., but not everybody does.
          • Data written to files or other output streams (including the Console) should use the platform default encoding if no explicit encoding is specified. Every AbstractStringLayout should provide a way to specify an encoding that overrides the platform default encoding. AbstractStringLayout already does this by having a mandatory constructor that takes a Charset. However, it doesn't account for the possibility that it is constructed with a null Charset. IMO, it should be setting the Charset to the platform default if it's constructed with a null Charset. Furthermore, every class that extends AbstractStringLayout should use this Charset /except/ XMLLayout, which should ALWAYS use UTF-8. The `@PluginAttr("charset") String charsetName` parameter for XMLLayout#createLayout should be removed, the `Charset charset` parameter for XMLLayout#XMLLayout should be removed, and UTF-8 should be hardcoded as the value for super(). (In fact, right now the XMLLayout is broken, because it accepts a user-supplied Charset but the header is hard-coded to <?xml version="1.0" encoding="UTF-8"?>.)

          (Side note: Strings in Java are Unicode, not UTF-8. Some of the people commenting here have used these terms interchangeably, but they are not interchangeable. Unicode is the system of assigning decimal numbers to characters. UTF-8, UTF-16, UTF-32, etc. are different systems for interpreting bytes as these decimal, Unicode numbers. http://stackoverflow.com/questions/643694/utf-8-vs-unicode)

          Show
          Nick Williams added a comment - Okay, I think I understand all of this better. The 100% correct solution that will work all of the time is to change all of the computers in the world to have a default UTF-8 platform encoding. Too bad we don't have the power to do that... Here's what I think should be happening: Internally, absolutely everything should be handled UTF-8 for consistency's sake. However, when dealing with external resources: Data transmitted over the wire or interprocess (such as net, flume, etc.) should use UTF-8 exclusively. XML written to a file or other non-network output stream should use UTF-8 exclusively. Data read from files or other non-network input streams should detect the file encoding (is this possible? do we have to just rely on the platform default here?) and read in that file encoding, converting to Unicode upon reading (which should happen automatically, since all Strings in Java are Unicode). My understanding of XML is that you SHOULD always encode it a Unicode variant such as UTF-8, UTF-16, etc., but not everybody does. Data written to files or other output streams (including the Console) should use the platform default encoding if no explicit encoding is specified. Every AbstractStringLayout should provide a way to specify an encoding that overrides the platform default encoding. AbstractStringLayout already does this by having a mandatory constructor that takes a Charset. However, it doesn't account for the possibility that it is constructed with a null Charset. IMO, it should be setting the Charset to the platform default if it's constructed with a null Charset. Furthermore, every class that extends AbstractStringLayout should use this Charset /except/ XMLLayout, which should ALWAYS use UTF-8. The `@PluginAttr("charset") String charsetName` parameter for XMLLayout#createLayout should be removed, the `Charset charset` parameter for XMLLayout#XMLLayout should be removed, and UTF-8 should be hardcoded as the value for super(). (In fact, right now the XMLLayout is broken, because it accepts a user-supplied Charset but the header is hard-coded to <?xml version="1.0" encoding="UTF-8"?>.) (Side note: Strings in Java are Unicode, not UTF-8. Some of the people commenting here have used these terms interchangeably, but they are not interchangeable. Unicode is the system of assigning decimal numbers to characters. UTF-8, UTF-16, UTF-32, etc. are different systems for interpreting bytes as these decimal, Unicode numbers. http://stackoverflow.com/questions/643694/utf-8-vs-unicode )
          Hide
          Remko Popma added a comment - - edited

          Nick, nice find on the header being hard-coded to UTF-8 in XMLLayout!

          I agree that the default encoding for XMLLayout should be UTF-8 if the user did not specify a charset in the config, but I would prefer to fix the hard-coded header to use the specified charset instead. I think it is ok to use user-specified encodings for XML (http://www.w3schools.com/xml/xml_encoding.asp) as long as the header correctly reflects that.

          Otherwise what you're saying sounds reasonable.

          This made me think: perhaps the default encoding depends on the layout?

          • UTF-8 for XMLLayout, RFC5424Layout (actually RFC5424 seems to require UTF-8 or US-ASCII)
          • Platform default for BasicLayout, PatternLayout

          Not sure about HTMLLayout (leaning towards UTF-8) and SyslogLayout (no clue... What software would be used to view these log records?).

          Regardless of what default we choose for HTMLLayout, the header is currently missing the encoding name.
          We should add something like this between the <head> and the <title> elements:
          <meta http-equiv="Content-Type" content="text/html; charset=XXXXXX"> <-- getCharset().name()

          Show
          Remko Popma added a comment - - edited Nick, nice find on the header being hard-coded to UTF-8 in XMLLayout! I agree that the default encoding for XMLLayout should be UTF-8 if the user did not specify a charset in the config, but I would prefer to fix the hard-coded header to use the specified charset instead. I think it is ok to use user-specified encodings for XML ( http://www.w3schools.com/xml/xml_encoding.asp ) as long as the header correctly reflects that. Otherwise what you're saying sounds reasonable. This made me think: perhaps the default encoding depends on the layout? UTF-8 for XMLLayout, RFC5424Layout (actually RFC5424 seems to require UTF-8 or US-ASCII) Platform default for BasicLayout, PatternLayout Not sure about HTMLLayout (leaning towards UTF-8) and SyslogLayout (no clue... What software would be used to view these log records?). Regardless of what default we choose for HTMLLayout, the header is currently missing the encoding name. We should add something like this between the <head> and the <title> elements: <meta http-equiv="Content-Type" content="text/html; charset=XXXXXX"> <-- getCharset().name()
          Hide
          Remko Popma added a comment - - edited

          Yay! Fancy wiki formatting!! Yippee! Awesome!

          Show
          Remko Popma added a comment - - edited Yay! Fancy wiki formatting!! Yippee! Awesome!
          Hide
          Nick Williams added a comment -

          I like that plan:

          • XMLLayout and HTMLLayout default to UTF-8 if none specified, but allow customization
          • RFC5424 hard-coded to UTF-8 and does not allow customization (US-ASCII cannot display Japanese and other characters, therefor we should just stick with UTF-8 here to avoid any possibility of problems)
          • Other AbstractStringLayouts default to platform encoding if none specified

          The SyslogLayout confuses me. RFC5424 /is/ "The Syslog Protocol," so why would we need both Layouts? Either way, it appears that Syslog, being RFC5424, requires either UTF-8 or US-ASCII, so I think it also should be hard-coded to UTF-8.

          Show
          Nick Williams added a comment - I like that plan: XMLLayout and HTMLLayout default to UTF-8 if none specified, but allow customization RFC5424 hard-coded to UTF-8 and does not allow customization (US-ASCII cannot display Japanese and other characters, therefor we should just stick with UTF-8 here to avoid any possibility of problems) Other AbstractStringLayouts default to platform encoding if none specified The SyslogLayout confuses me. RFC5424 /is/ "The Syslog Protocol," so why would we need both Layouts? Either way, it appears that Syslog, being RFC5424, requires either UTF-8 or US-ASCII, so I think it also should be hard-coded to UTF-8.
          Hide
          Nick Williams added a comment -

          It's about time! Finally I can <code> when I need to.

          Show
          Nick Williams added a comment - It's about time! Finally I can <code> when I need to.
          Hide
          Gary Gregory added a comment -

          What encoding should I used in the layout for the example in the description? The obvious ones I've tried give me junk output (Shift_JIS, MS932).

          Show
          Gary Gregory added a comment - What encoding should I used in the layout for the example in the description? The obvious ones I've tried give me junk output (Shift_JIS, MS932).
          Hide
          Remko Popma added a comment -

          You mean, in order to reproduce the issue?
          Perhaps you have reproduced the issue! What is your platform encoding and where are you seeing the garbage? Console, or text editor/viewer?

          Show
          Remko Popma added a comment - You mean, in order to reproduce the issue? Perhaps you have reproduced the issue! What is your platform encoding and where are you seeing the garbage? Console, or text editor/viewer?
          Hide
          Gary Gregory added a comment - - edited

          I want to see the correct chars on Windows, so I want to plugin the right charset. I know that using Cp1252 (my default) will yield junk.

          My misunderstanding was that I could use UTF-8 and have the right thing happen, which it does not, because... the Windows CMD console always displays in the platform encoding? So how can I see the right characters on Windows in the console?

          Show
          Gary Gregory added a comment - - edited I want to see the correct chars on Windows, so I want to plugin the right charset. I know that using Cp1252 (my default) will yield junk. My misunderstanding was that I could use UTF-8 and have the right thing happen, which it does not, because... the Windows CMD console always displays in the platform encoding? So how can I see the right characters on Windows in the console?
          Hide
          Gary Gregory added a comment -

          Data read from files or other non-network input streams should detect the file encoding (is this possible? do we have to just rely on the platform default here?)

          XML parsers can auto-detect file encodings because the XML standard defines Byte Order Mark (BOM) bytes and the xml processing instruction with its encoding attribute.

          For all other kinds of files, you have to know the encoding.

          Show
          Gary Gregory added a comment - Data read from files or other non-network input streams should detect the file encoding (is this possible? do we have to just rely on the platform default here?) XML parsers can auto-detect file encodings because the XML standard defines Byte Order Mark (BOM) bytes and the xml processing instruction with its encoding attribute. For all other kinds of files, you have to know the encoding.
          Hide
          Remko Popma added a comment - - edited

          Gary, I actually think you are hitting the same problem that made me file this JIRA.
          Windows console always displays the platform encoding and cannot display UTF-8 bytes for multi-byte characters.

          To answer your question how you can see the right characters on Windows in the console, I don't see any other way than to change your platform encoding. http://www.emeditor.com/help/glossary/systemdefaultencoding.htm
          I tried this and changed the language setting to US English. After the required reboot the console used a different font and was unable to display Japanese. You can try changing your setting to Japanese, and try to reproduce the issue again, specifying one of these encodings in the Layout charset: UTF-8 (I suspect this won't work) and Shift_JIS or MS932 (MS932 should work if your machine has that codepage and you successfully switched to Japanese).

          Show
          Remko Popma added a comment - - edited Gary, I actually think you are hitting the same problem that made me file this JIRA. Windows console always displays the platform encoding and cannot display UTF-8 bytes for multi-byte characters. To answer your question how you can see the right characters on Windows in the console, I don't see any other way than to change your platform encoding. http://www.emeditor.com/help/glossary/systemdefaultencoding.htm I tried this and changed the language setting to US English. After the required reboot the console used a different font and was unable to display Japanese. You can try changing your setting to Japanese, and try to reproduce the issue again, specifying one of these encodings in the Layout charset: UTF-8 (I suspect this won't work) and Shift_JIS or MS932 (MS932 should work if your machine has that codepage and you successfully switched to Japanese).
          Hide
          Gary Gregory added a comment -

          A couple of data points for testing and the record.

          • If I set the file appender charset to Shift-JIS, the file has the correct data when I view it in Notepad++ and Eclipse with the encoding to view the file also set to Shift-JIS. Using UTF-8 in the viewer looks like junk.
          • If I set the file appender charset to UTF-8, the file has the correct data when I view it in Eclipse with the encoding to view the file also set to UTF-8. Using Shift-JIS in the viewer looks like junk.

          The above is good, no bug there.

          Windows lets you change code pages in a CMD console with the chcp command. For me though, chcp 932 does not work, where 932 is the Windows code page number for Shift-JIS.

          Another tip is that you have to you a font in the CMD that is at least Lucinda Console for display certain chars.

          I tried to output and view UTF-8 on the CMD console without success. I tried, chcp 65001, no go. I tried cmd /U, no go.

          Show
          Gary Gregory added a comment - A couple of data points for testing and the record. If I set the file appender charset to Shift-JIS , the file has the correct data when I view it in Notepad++ and Eclipse with the encoding to view the file also set to Shift-JIS . Using UTF-8 in the viewer looks like junk. If I set the file appender charset to UTF-8 , the file has the correct data when I view it in Eclipse with the encoding to view the file also set to UTF-8 . Using Shift-JIS in the viewer looks like junk. The above is good, no bug there. Windows lets you change code pages in a CMD console with the chcp command. For me though, chcp 932 does not work, where 932 is the Windows code page number for Shift-JIS . Another tip is that you have to you a font in the CMD that is at least Lucinda Console for display certain chars. I tried to output and view UTF-8 on the CMD console without success. I tried, chcp 65001 , no go. I tried cmd /U , no go.
          Hide
          Remko Popma added a comment -

          Looks like this was a fruitful discussion: 5 new JIRAs, 4 already fixed (thanks, Gary!)

          I did another test and I am satisfied that the current solution (using platform default in BasicLayout, PatternLayout if no charset was specified in configuration) is correct for Japanese users (and Chinese users, based on conversations with Chinese colleagues).

          Gary, do you still have doubts?
          If testing on your PC is a concern, you can give me commands to execute on my Japanese environment and I can give you back the results.

          Show
          Remko Popma added a comment - Looks like this was a fruitful discussion: 5 new JIRAs, 4 already fixed (thanks, Gary!) I did another test and I am satisfied that the current solution (using platform default in BasicLayout, PatternLayout if no charset was specified in configuration) is correct for Japanese users (and Chinese users, based on conversations with Chinese colleagues). Gary, do you still have doubts? If testing on your PC is a concern, you can give me commands to execute on my Japanese environment and I can give you back the results.
          Hide
          Gary Gregory added a comment -

          I think am resigned to the fact that we have to let the platform default kick in when the charset is not specified. This makes it simpler to use.

          A best-practice, which we could document if we agree, is to always specify a charset in your configs. This could be enforced by making the charset required but that seems overzealous.

          We do have write-once, run anywhere, but differently

          Show
          Gary Gregory added a comment - I think am resigned to the fact that we have to let the platform default kick in when the charset is not specified. This makes it simpler to use. A best-practice, which we could document if we agree, is to always specify a charset in your configs. This could be enforced by making the charset required but that seems overzealous. We do have write-once, run anywhere, but differently
          Hide
          Remko Popma added a comment -

          That is great, thanks for the confirmation!

          I think specifying a charset is especially valuable when sharing log files between different environments.
          When I researched if log4j-1.x and logback had any encoding issues, I only found questions for that scenario.
          So this is definitely worth mentioning in the FAQ page.

          I think for normal usage not specifying a charset is fine. All Japanese and Chinese users I've talked to about this said they never used the option to specify a charset in log4j-1.x, but also never experienced any problems.

          That said, I'm not opposed to using a charset in a few examples to show that this is possible, as long as we avoid giving the impression that this is mandatory now with log4j-2.0.

          Show
          Remko Popma added a comment - That is great, thanks for the confirmation! I think specifying a charset is especially valuable when sharing log files between different environments. When I researched if log4j-1.x and logback had any encoding issues, I only found questions for that scenario. So this is definitely worth mentioning in the FAQ page. I think for normal usage not specifying a charset is fine. All Japanese and Chinese users I've talked to about this said they never used the option to specify a charset in log4j-1.x, but also never experienced any problems. That said, I'm not opposed to using a charset in a few examples to show that this is possible, as long as we avoid giving the impression that this is mandatory now with log4j-2.0.

            People

            • Assignee:
              Remko Popma
              Reporter:
              Remko Popma
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development