Issue Details (XML | Word | Printable)

Key: AXIS-2342
Type: Bug Bug
Status: Open Open
Assignee: Axis Developers Mailing List
Reporter: Thiago Jung Bauermann
Votes: 5
Watchers: 5
Operations

If you were logged in you would be able to see more operations.
Axis

Reopen issue: Character entities are escaped too aggressively

Created: 16/Dec/05 01:41 AM   Updated: 22/Feb/08 09:23 AM
Return to search
Component/s: Serialization/Deserialization
Affects Version/s: 1.0
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works AXIS_2342.diff 2007-04-23 03:10 PM Rodrigo Ruiz 3 kB
Text File Licensed for inclusion in ASF works PATCH_2342.txt 2006-01-27 07:39 AM Christian Müller 2 kB
File Licensed for inclusion in ASF works TEST_2342.diff 2007-04-24 08:40 AM Rodrigo Ruiz 5 kB
Text File Licensed for inclusion in ASF works TESTCASE_2342.txt 2006-01-27 07:39 AM Christian Müller 3 kB
Environment:
Operating System: All
Platform: All
Issue Links:
Cloners
 

Bugzilla Id: 19327


 Description  « Hide
We are using SOAP to send XML documents from client to server and back. The
documents contain a lot of non-ASCII data. This is encoded as UTF-8 by us.
However, when retrieved from an Axis server, Axis will escape almost all of our
characters into character entities (so &#...;) This means messages become about
three times as big as they have to for 'international' documents, which for us
is a large performance problem. I narrowed down the problem to
  XMLUtils::xmlEncodeString
that has the code:
                if (((int)chars[i]) > 127) {
                        strBuf.append("&#");
                        strBuf.append((int)chars[i]);
                        strBuf.append(";");
This seems unnecessary to me, as Axis will send all messages in UTF-8 anyway,
for which no encoding is necessary (and should encoding be configurable, I feel
this should be escaped elsewhere).

Is there any reason for this code, I commented it out and it seemed to have no
adverse effect on our application (apart from reduced network traffic)?

Tested with 1.0, also looked up in the sources of 1.1-rc2.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Thiago Jung Bauermann added a comment - 16/Dec/05 01:50 AM
I am opening this issue again because it appears that the fix to this problem was removed from the source code. From what I could tell looking at the subversion repository, the revision 257917 restored the old, buggy code.

This is affecting me because my application must talk to a webservice which doesn't understand XML character entities (I know, it should, but fixing the webservice is not an option). The only way I can send non-ASCII characters is using UTF-8 or ISO-8859-1, which is not possible with Axis.

I tested with Axis 1.2.1 and 1.3. I didn't test with the trunk version, but looking at the code with ViewCVS, the problem is still there (class UTF8Encoder).

Christian Müller added a comment - 27/Jan/06 07:39 AM
Hi All!

We run in the same problems... :o(
I have fixed this issue and promote the patch and the Test (separat). The patch ist tested with special german characters "ä ö ü ß Ä Ö Ü".

Regards,
Christian

Tom Gansor added a comment - 22/May/06 09:37 PM
Hi All!

We also ran into this one. (Again with german chars) again with a service not supporting
character entities properly.

Thanks to Christian for the fix.

Btw. I was surprised seeing the original source code,
I assumed an UTFEncoder would produce UTF8-encoded XML rather than forcing
character entities for any character beyond 0x7f.... very funny.

Regards, Tom

Vinod Kumar added a comment - 16/Nov/06 10:01 PM
Hi,

Is there any verison of Axis.jar where the patch is already applied. We are facing the similar issue in our application with axix 1.3.

Regards
Vinod

Rodrigo Ruiz added a comment - 23/Apr/07 10:19 AM
I am a bit puzzled with this bug.

In principle, I agree with Thiago. If the output writer is created with the correct encoding (and it seems it is), there should be no need to "re-encode" characters above 0x7F in UTF-8, or above 0xFFFF in UTF-16.

It seems the class org.apache.axis.components.encoding.AbstractXmlEncoder fixes this issue in its "encode" method. The problem is that none of its subclasses uses the same strategy for their writeEncoded() methods. Why is it so?

In fact, looking at the code, once the "entities replacement" code is removed from the subclasses, they are all the same! It seems we could live with only a single XMLEncoder implementation for all encodings! Please, can anybody confirm or correct this?

Rodrigo Ruiz added a comment - 23/Apr/07 03:10 PM
This patch modifies the DefaultXMLEncoder and XMLEncoderFactory classes as specified in my last comments.

It seems to work. At least, it passes most functional-tests (those not relying on unavailable remote services). I have also tested it with SoapUI with success.

Hope it helps

Rodrigo Ruiz added a comment - 24/Apr/07 08:40 AM
Unit test for the patch in AXIS_2342.diff

Wilbert Pol added a comment - 19/Jul/07 11:02 AM
We ran into this issue with axis 1.4 in a hybrid java/perl/.net environment trying to communicate a euro sign (unicode 20ac, utf8 e282ac). The axis 1.4 service advertised itself as outputting utf8 but the euro sign got encoded as € which imo looks more like a dirty hack.

What actually helped was removing all the special encoding code from the default case in the writeEncoded method in org.apache.axis.component.encoding.UTF8Encoder. This made axis output a nice utf8 euro sign. It looks like there's some final encoding going on at a higher level in axis, but I didn't bother to look into it further.

The relevant section of UTF8Encoder becomes:

                case '\t':
                    writer.write(TAB);
                    break;
                default:
                    if (character < 0x20) {
                        throw new IllegalArgumentException(Messages.getMessage(
                                "invalidXmlCharacter00",
                                Integer.toHexString(character),
                                xmlString.substring(0, i)));
                    } else {
                        writer.write(character);
                    }
                    break;
            }


Arseny S added a comment - 22/Feb/08 09:23 AM
This is a major problem for our project: we need to send Russian text through Axis to Axis/C and .NET.
How can this encoding be called UTF8 if it encodes all symbols after 0x7f in a special way?? It should be called HTML encoding then.
The right way of encoding is shown in the previous patch.

For now we are searching for some kind of workaround. May be using some generic type with our Serializer/Desrializer classes.