Axis
  1. Axis
  2. AXIS-2342

Reopen issue: Character entities are escaped too aggressively

    Details

    • Type: Bug Bug
    • Status: Open
    • Resolution: Unresolved
    • Affects Version/s: 1.0
    • Fix Version/s: None
    • Labels:
      None
    • Environment:
      Operating System: All
      Platform: All

      Description

      We are using SOAP to send XML documents from client to server and back. The
      documents contain a lot of non-ASCII data. This is encoded as UTF-8 by us.
      However, when retrieved from an Axis server, Axis will escape almost all of our
      characters into character entities (so &#... This means messages become about
      three times as big as they have to for 'international' documents, which for us
      is a large performance problem. I narrowed down the problem to
      XMLUtils::xmlEncodeString
      that has the code:
      if (((int)chars[i]) > 127) {
      strBuf.append("&#");
      strBuf.append((int)chars[i]);
      strBuf.append(";");
      This seems unnecessary to me, as Axis will send all messages in UTF-8 anyway,
      for which no encoding is necessary (and should encoding be configurable, I feel
      this should be escaped elsewhere).

      Is there any reason for this code, I commented it out and it seemed to have no
      adverse effect on our application (apart from reduced network traffic)?

      Tested with 1.0, also looked up in the sources of 1.1-rc2.

      1. TEST_2342.diff
        5 kB
        Rodrigo Ruiz
      2. AXIS_2342.diff
        3 kB
        Rodrigo Ruiz
      3. PATCH_2342.txt
        2 kB
        Christian Müller
      4. TESTCASE_2342.txt
        3 kB
        Christian Müller

        Issue Links

          Activity

          Hide
          Rajani Gundimeda added a comment -

          We were facing this issue in our product and below is the workaround that I found.
          I hope this helps and let me know if there is anything better that I can do.

          We have a web service operation where the response parameter is a xml string (data type xs:string).
          When the response xml string contains a multi byte character, axis escapes that to a multi char hex bytes.

          For example
          1) If the response contains a Japanese char 功 whose UTF-8 hex values are E5 8A 9F.
          2) I have checked the java code just before we return the response string and its hex byte representation was E5 8A 9F
          3) But( RPCProvider of) Axis 1.3 was escaping the 功 char to E5 160 178 i,e : 功 (which was not expected)
          4) When I changed the java VM arguments as -Dfile.encoding=UTF-8, Axis escaped the chars to 0x529F, which is not expected either,
          5) But when I set the java VM arguments as -Dfile.encoding=ISO-8859-1 and started the server. Axis was escaping 功 char to E5 8A 9F i,e : 功

          I used -Dfile.encoding=ISO-8859-1 as work around.

          But I wonder why didn't UTF-8 setting work?
          Is there any else that I can do/configure Axis to support or escape the bytes to UTF-8 encoding only?

          For the above testing I used : JBoss, Axis 1.3, Windows 7 (Default Windows code page was Cp1252)

          Thanks
          Rajani

          Show
          Rajani Gundimeda added a comment - We were facing this issue in our product and below is the workaround that I found. I hope this helps and let me know if there is anything better that I can do. We have a web service operation where the response parameter is a xml string (data type xs:string). When the response xml string contains a multi byte character, axis escapes that to a multi char hex bytes. For example 1) If the response contains a Japanese char 功 whose UTF-8 hex values are E5 8A 9F. 2) I have checked the java code just before we return the response string and its hex byte representation was E5 8A 9F 3) But( RPCProvider of) Axis 1.3 was escaping the 功 char to E5 160 178 i,e : 功 (which was not expected) 4) When I changed the java VM arguments as -Dfile.encoding=UTF-8, Axis escaped the chars to 0x529F, which is not expected either, 5) But when I set the java VM arguments as -Dfile.encoding=ISO-8859-1 and started the server. Axis was escaping 功 char to E5 8A 9F i,e : 功 I used -Dfile.encoding=ISO-8859-1 as work around. But I wonder why didn't UTF-8 setting work? Is there any else that I can do/configure Axis to support or escape the bytes to UTF-8 encoding only? For the above testing I used : JBoss, Axis 1.3, Windows 7 (Default Windows code page was Cp1252) Thanks Rajani
          Hide
          Arseny S added a comment -

          This is a major problem for our project: we need to send Russian text through Axis to Axis/C and .NET.
          How can this encoding be called UTF8 if it encodes all symbols after 0x7f in a special way?? It should be called HTML encoding then.
          The right way of encoding is shown in the previous patch.

          For now we are searching for some kind of workaround. May be using some generic type with our Serializer/Desrializer classes.

          Show
          Arseny S added a comment - This is a major problem for our project: we need to send Russian text through Axis to Axis/C and .NET. How can this encoding be called UTF8 if it encodes all symbols after 0x7f in a special way?? It should be called HTML encoding then. The right way of encoding is shown in the previous patch. For now we are searching for some kind of workaround. May be using some generic type with our Serializer/Desrializer classes.
          Hide
          Wilbert Pol added a comment -

          We ran into this issue with axis 1.4 in a hybrid java/perl/.net environment trying to communicate a euro sign (unicode 20ac, utf8 e282ac). The axis 1.4 service advertised itself as outputting utf8 but the euro sign got encoded as € which imo looks more like a dirty hack.

          What actually helped was removing all the special encoding code from the default case in the writeEncoded method in org.apache.axis.component.encoding.UTF8Encoder. This made axis output a nice utf8 euro sign. It looks like there's some final encoding going on at a higher level in axis, but I didn't bother to look into it further.

          The relevant section of UTF8Encoder becomes:

          case '\t':
          writer.write(TAB);
          break;
          default:
          if (character < 0x20)

          { throw new IllegalArgumentException(Messages.getMessage( "invalidXmlCharacter00", Integer.toHexString(character), xmlString.substring(0, i))); }

          else

          { writer.write(character); }

          break;
          }

          Show
          Wilbert Pol added a comment - We ran into this issue with axis 1.4 in a hybrid java/perl/.net environment trying to communicate a euro sign (unicode 20ac, utf8 e282ac). The axis 1.4 service advertised itself as outputting utf8 but the euro sign got encoded as € which imo looks more like a dirty hack. What actually helped was removing all the special encoding code from the default case in the writeEncoded method in org.apache.axis.component.encoding.UTF8Encoder. This made axis output a nice utf8 euro sign. It looks like there's some final encoding going on at a higher level in axis, but I didn't bother to look into it further. The relevant section of UTF8Encoder becomes: case '\t': writer.write(TAB); break; default: if (character < 0x20) { throw new IllegalArgumentException(Messages.getMessage( "invalidXmlCharacter00", Integer.toHexString(character), xmlString.substring(0, i))); } else { writer.write(character); } break; }
          Hide
          Rodrigo Ruiz added a comment -

          Unit test for the patch in AXIS_2342.diff

          Show
          Rodrigo Ruiz added a comment - Unit test for the patch in AXIS_2342.diff
          Hide
          Rodrigo Ruiz added a comment -

          This patch modifies the DefaultXMLEncoder and XMLEncoderFactory classes as specified in my last comments.

          It seems to work. At least, it passes most functional-tests (those not relying on unavailable remote services). I have also tested it with SoapUI with success.

          Hope it helps

          Show
          Rodrigo Ruiz added a comment - This patch modifies the DefaultXMLEncoder and XMLEncoderFactory classes as specified in my last comments. It seems to work. At least, it passes most functional-tests (those not relying on unavailable remote services). I have also tested it with SoapUI with success. Hope it helps
          Hide
          Rodrigo Ruiz added a comment -

          I am a bit puzzled with this bug.

          In principle, I agree with Thiago. If the output writer is created with the correct encoding (and it seems it is), there should be no need to "re-encode" characters above 0x7F in UTF-8, or above 0xFFFF in UTF-16.

          It seems the class org.apache.axis.components.encoding.AbstractXmlEncoder fixes this issue in its "encode" method. The problem is that none of its subclasses uses the same strategy for their writeEncoded() methods. Why is it so?

          In fact, looking at the code, once the "entities replacement" code is removed from the subclasses, they are all the same! It seems we could live with only a single XMLEncoder implementation for all encodings! Please, can anybody confirm or correct this?

          Show
          Rodrigo Ruiz added a comment - I am a bit puzzled with this bug. In principle, I agree with Thiago. If the output writer is created with the correct encoding (and it seems it is), there should be no need to "re-encode" characters above 0x7F in UTF-8, or above 0xFFFF in UTF-16. It seems the class org.apache.axis.components.encoding.AbstractXmlEncoder fixes this issue in its "encode" method. The problem is that none of its subclasses uses the same strategy for their writeEncoded() methods. Why is it so? In fact, looking at the code, once the "entities replacement" code is removed from the subclasses, they are all the same! It seems we could live with only a single XMLEncoder implementation for all encodings! Please, can anybody confirm or correct this?
          Hide
          Vinod Kumar added a comment -

          Hi,

          Is there any verison of Axis.jar where the patch is already applied. We are facing the similar issue in our application with axix 1.3.

          Regards
          Vinod

          Show
          Vinod Kumar added a comment - Hi, Is there any verison of Axis.jar where the patch is already applied. We are facing the similar issue in our application with axix 1.3. Regards Vinod
          Hide
          Tom Gansor added a comment -

          Hi All!

          We also ran into this one. (Again with german chars) again with a service not supporting
          character entities properly.

          Thanks to Christian for the fix.

          Btw. I was surprised seeing the original source code,
          I assumed an UTFEncoder would produce UTF8-encoded XML rather than forcing
          character entities for any character beyond 0x7f.... very funny.

          Regards, Tom

          Show
          Tom Gansor added a comment - Hi All! We also ran into this one. (Again with german chars) again with a service not supporting character entities properly. Thanks to Christian for the fix. Btw. I was surprised seeing the original source code, I assumed an UTFEncoder would produce UTF8-encoded XML rather than forcing character entities for any character beyond 0x7f.... very funny. Regards, Tom
          Hide
          Christian Müller added a comment -

          Hi All!

          We run in the same problems... :o(
          I have fixed this issue and promote the patch and the Test (separat). The patch ist tested with special german characters "ä ö ü ß Ä Ö Ü".

          Regards,
          Christian

          Show
          Christian Müller added a comment - Hi All! We run in the same problems... :o( I have fixed this issue and promote the patch and the Test (separat). The patch ist tested with special german characters "ä ö ü ß Ä Ö Ü". Regards, Christian
          Hide
          Thiago Jung Bauermann added a comment -

          I am opening this issue again because it appears that the fix to this problem was removed from the source code. From what I could tell looking at the subversion repository, the revision 257917 restored the old, buggy code.

          This is affecting me because my application must talk to a webservice which doesn't understand XML character entities (I know, it should, but fixing the webservice is not an option). The only way I can send non-ASCII characters is using UTF-8 or ISO-8859-1, which is not possible with Axis.

          I tested with Axis 1.2.1 and 1.3. I didn't test with the trunk version, but looking at the code with ViewCVS, the problem is still there (class UTF8Encoder).

          Show
          Thiago Jung Bauermann added a comment - I am opening this issue again because it appears that the fix to this problem was removed from the source code. From what I could tell looking at the subversion repository, the revision 257917 restored the old, buggy code. This is affecting me because my application must talk to a webservice which doesn't understand XML character entities (I know, it should, but fixing the webservice is not an option). The only way I can send non-ASCII characters is using UTF-8 or ISO-8859-1, which is not possible with Axis. I tested with Axis 1.2.1 and 1.3. I didn't test with the trunk version, but looking at the code with ViewCVS, the problem is still there (class UTF8Encoder).

            People

            • Assignee:
              Unassigned
              Reporter:
              Thiago Jung Bauermann
            • Votes:
              5 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development