Issue Details (XML | Word | Printable)

Key: AXIS-840
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Assignee: Axis Developers Mailing List
Reporter: Sander Bos
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Axis

Character entities are escaped too aggressively

Created: 26/Apr/03 12:29 AM   Updated: 16/Dec/05 01:41 AM
Return to search
Component/s: Serialization/Deserialization
Affects Version/s: 1.0
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File encodepatch.txt 2003-07-18 01:01 AM Jens Schumann 15 kB
File encoder.tar 2003-07-16 09:35 AM Jens Schumann 20 kB
GZip Archive encoder.tar.gz 2003-07-18 01:00 AM Jens Schumann 3 kB
Environment:
Operating System: All
Platform: All
Issue Links:
Cloners
 

Bugzilla Id: 19327


 Description  « Hide
We are using SOAP to send XML documents from client to server and back. The
documents contain a lot of non-ASCII data. This is encoded as UTF-8 by us.
However, when retrieved from an Axis server, Axis will escape almost all of our
characters into character entities (so &#...;) This means messages become about
three times as big as they have to for 'international' documents, which for us
is a large performance problem. I narrowed down the problem to
  XMLUtils::xmlEncodeString
that has the code:
                if (((int)chars[i]) > 127) {
                        strBuf.append("&#");
                        strBuf.append((int)chars[i]);
                        strBuf.append(";");
This seems unnecessary to me, as Axis will send all messages in UTF-8 anyway,
for which no encoding is necessary (and should encoding be configurable, I feel
this should be escaped elsewhere).

Is there any reason for this code, I commented it out and it seemed to have no
adverse effect on our application (apart from reduced network traffic)?

Tested with 1.0, also looked up in the sources of 1.1-rc2.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Davanum Srinivas added a comment - 29/Jun/03 05:21 AM
Fixed.

Thanks,
dims

Jens Schumann added a comment - 29/Jun/03 07:46 AM
Davanum,

I haven't been able to follow all changes post 1.1 release - did you already implement a new
encoding mechanism? Both version (1.82 & 1.83) will still not work while leaving the java world
(e.g. nusoap), see #15133.

To me it looks like the current xmlEncodeString method is way too simplified. See Steve' comment
at #15133 too.

Jens

Davanum Srinivas added a comment - 29/Jun/03 09:06 AM
Jens,

Did you try the latest cvs???

Thanks,
dims

Davanum Srinivas added a comment - 29/Jun/03 09:12 AM
Can you please send in a test case against say the nusoap interop service
(http://marc.theaimsgroup.com/?l=axis-dev&m=105543299708747&w=2,
http://dietrich.ganx4.com/nusoap/)

Thanks,
dims

Davanum Srinivas added a comment - 29/Jun/03 09:29 PM
Here's a test against
http://dietrich.ganx4.com/nusoap/testbed/round2_base.wsdl....Works Fine. Closing
this bug again.

-- dims

=============================================================================
import java.io.*;
public class Main {
    public static void main(String[] args) throws Exception {
        org.soapinterop.InteropLabLocator locator = new
org.soapinterop.InteropLabLocator();
        String s1 = new
String("\u00dc\u00cb\u00cf\u00d6O\u00e4\u00eb\u00ef\u00f6\u00fc\u00ff");
        org.soapinterop.InteropTestPortType port = locator.getinteropTestPort();
        PrintWriter out = new PrintWriter(new BufferedWriter(new
OutputStreamWriter(System.out, "CP850")),true);
        out.println(s1);
        String s = port.echoString(s1);
        out.println(s);
    }
}
=============================================================================

Davanum Srinivas added a comment - 30/Jun/03 01:47 AM
Reopening bug. Maybe we should have a flag to aggressively encode character
entities.(see Jens' note at
http://marc.theaimsgroup.com/?l=axis-dev&m=105689994130473&w=2) Shouldn't this
be raised with the nusoap folks as well? Since axis's client works fine?

-- dims

Jens Schumann added a comment - 03/Jul/03 11:24 PM
I have taken a look into the Xerces XMLSerializer class. printText() is pretty much what we are
looking for. Would it be ok if with migrate the serializer code and depended classes AND add
support for other encodings at the same time?

Jens Schumann added a comment - 03/Jul/03 11:30 PM
Oh, and btw.

I don't think this is a nusoap issue. If we declare something being UTF-8 we should encode strings
as UTF-8, wether using the double/three/four byte or &#... representation.

Davanum Srinivas added a comment - 03/Jul/03 11:40 PM
Jens,

Please go ahead and send in a patch. See patch guidelines at
(http://nagoya.apache.org/wiki/apachewiki.cgi?AxisProjectPages/SubmitPatches).

Thanks,
dims

Davanum Srinivas added a comment - 03/Jul/03 11:41 PM
Please don't forget to send in test case(s) as well.

-- dims

Steve Loughran added a comment - 03/Jul/03 11:46 PM
Jens -this Xerces serializer coe you mention: would it make axis dependent on
Xerces? Or do we have to cut and paste the relevant portions, thus creating
maintenance woes further down the line?

-steve

Jens Schumann added a comment - 03/Jul/03 11:51 PM
I was about to choose the most intuitive design pattern: Copy & Paste.
This will include a signature change of XMLUtils, I guess.

Jens Schumann added a comment - 04/Jul/03 12:06 AM
Steve:

I forgot to ask: Why should encoding create maintenance problems? Encoding is pretty much a
fixed topic, as far I know.


Steve Loughran added a comment - 04/Jul/03 03:47 AM
Usually. but anywhere you cut and paste code they diverge and your costs/effort
increases.

For example, we will need to throw exceptions for \000 and other illegal chars,
that may change the code & we are off on our own little branch.

Nb, what assumptions are we making about encoding. Does axis assume UTF-8
everywhere it creates/reads XML? Or should the encoding methods be told what
encoding to expect and do the right thing for the locale?

Jens Schumann added a comment - 04/Jul/03 05:54 AM
Steve:

As always (in the last weeks) I agree on your concers. And indeed, illegal/unexpected error
handling is one problem area. But I would always consider using well established / tested sources
instead of reinventing the wheel.

I have moved a bunch of xerces classes to the axis tree and this implementation would allow us to
support a lot more than UTF-8/ ISO-8859-1. However this implementation is pretty expensive if
we use it in the same way as we do it right now (by calling a static member for every string). In
case the xerces based encoding would be of any interest we should use one XMLStringEncoder
instance per request which uses the incoming request encoding or a pre configured encoding in
case of axis clients.



  

Jens Schumann added a comment - 07/Jul/03 11:35 PM
I have started to migrate (copy and paste) a few Xerces classes to ensure proper umlaut encoding
within Axis (EncodingInfo, EncodingMap, Encodings, Printer). Apart from migrating those classes I
tried to achieve the following goals:

1. Support other encoding styles than UTF-8/ISO-8859-1.

2. Make encoding configurable through server-config.wsdd settings, use UTF-8 as default.

3. Provide static, request dependant (ThreadLocal) access to the current encoding name/ String
encoder within Axis.

4. Support two encoding strategies:
 a) Fixed encoding.
 b) Client call dependant response encoding.

Apart from some questions "implementing" 1 to 4a is pretty straight forward.
A few questions:

Do you think 4b could be useful at all? Looks like a major change for axis.

The Xerces Encoder makes use of an internal writer which inherits usage of an OutputStream. This
is pretty expensive. Also the current implementation is stateful and not thread safe. Therefore I
was about to provide a dedicated StringEncoder for every request using ThreadLocals, similar to
AxisEngine. getCurrentMessageContext(). Do you think this is OK? If yes, is a static method in
AxisEngine a proper location to access the current StringXmlEncoder?

The Xerces encoder uses sun.io.CharToByteConverter!?

Currently I do have to deal with UnsupportedEncodingExceptions and low level IOExceptions within
the XML Encoder. In case of an UnsupportedEncodingException I use UTF-8 as fallback (and
complain about the wrong encoding name). This would happen once during initialization. However
it is possible to run into IOExceptions for every String->XML encode. During SOAP Response
Envelope Encoding I could throw a SOAP Fault. What should I do while encoding SOAP Request or
AxisFault elements?

Rick Kellogg added a comment - 07/Jul/03 11:48 PM
The WS-I Basic Profile requirement R1012 states:

A MESSAGE MUST be serialized as either UTF-8 or UTF-16.


Jens Schumann added a comment - 08/Jul/03 12:25 AM
I remember reading somewhere that the axis team wasn't sure about supporting WS-I Basic Profile.
What is the current state there?

Rick Kellogg added a comment - 08/Jul/03 12:28 AM
Support for WS-I is a requirement for JAX-RPC 1.1 compliance. As such, we
definitely do plan on supporting all of the requirements listed in the WS-I
Basic Profile.

The work you are doing looks very positive. Hopefully in a six months we will
not need to support anything but UTF-8 and UTF-16 since it will be mandated.
This in turn will help with interop.

Jens Schumann added a comment - 08/Jul/03 12:37 AM
OK. Got it.

Should we still be able to support non compliant encodings afterwards? If not I could certainly
clean out the xerces classes a lot.

Rick Kellogg added a comment - 08/Jul/03 12:41 AM
I will leave this decision to others more qualified to answer.

Compliance with WS-I does not preclude Axis from supporting other encodings.
It only states we must use UTF-8 or UTF-16. I think we might have a standards
compliance mode or something along those lines. We have yet to discuss it.

Eric Friedman added a comment - 08/Jul/03 12:59 AM
I think we should not preclude other encodings from being used. 32 bit unicode
is a reality, and in some places (China) government regulations stipulate that
software must support certain non-unicode encodings.

Glen Daniels added a comment - 08/Jul/03 01:19 AM
+1

Steve Loughran added a comment - 08/Jul/03 05:30 AM
I've just bounced a q. off to SOAPBuilders to see what they think.

To date, Axis does what? UTF-8 only? So moving to UTF-8 and UTF-16 only is an
improvement, and brings us in line with WS-I

But if we add support for arbitrary encodings, then we complicate SOAP for
everyone. Someone could write clients that post requests in, say, Sanskrit, have
it all work on an Axis impl, and then complain when the service implementors
moved to gSOAP.

For the sake of interop, therefore, keeping the #of encodings we support
constant and matching those everyone else does, would seem to be a good thing.

-Steve (typing on a UK keyboard)

Rick Kellogg added a comment - 08/Jul/03 08:02 PM
Agreed. We should stick with the WS-I guidelines.

Davanum Srinivas added a comment - 08/Jul/03 08:05 PM
WS-I testing tools are at: http://www.ws-i.org/implementation.aspx

Note: They use Axis too :)

-- dims

Jens Schumann added a comment - 09/Jul/03 05:44 AM
OK.

I will provide a patch which supports UTF-8 and UTF-16 only, with the ability to extend supported
encodings in the future. An active Axis instance will always use one fixed encoding, UTF-8 will be
default, UTF-16 may be enabled through configuration.

This may take a few days (not that this is really complicated ;)

Davanum Srinivas added a comment - 09/Jul/03 07:18 AM
sure. thanks.

-- dims

Jens Schumann added a comment - 16/Jul/03 09:32 AM
Since we will support UTF-8 and UTF-16 only (for now) the Xerces based implementation was way
too heavy. Therefore I have searched for an alternative and found http://czyborra.com/utf/. I have
implemented the two encoders based on the presented algorithms. See attachment for a proof of
concept.

Steve: You said in #15133 we need to handle chars < 32. Do you have any further details for me?
How should we treat ASCII0. Throw a runtime exception?

Jens Schumann added a comment - 16/Jul/03 09:35 AM
Created an attachment (id=7320)
UTF-8/UTF-16 encoder - proof of concept

Jens Schumann added a comment - 16/Jul/03 09:36 AM
Sorry, attachment is a simple .tar.

Steve Loughran added a comment - 16/Jul/03 11:43 PM
Jens, see bug ID 15494 regarding handling of zeroes.

Essentially any char < 32 other than tab, cr and lf is illegal, and we should
throw a runtime exception to state that fact.

Jens Schumann added a comment - 18/Jul/03 01:00 AM
Created an attachment (id=7348)
New UTF8/UTF16 XMLEncoder - fixes #15133, #15494, #19327 (tar.gz)

Jens Schumann added a comment - 18/Jul/03 01:01 AM
Created an attachment (id=7349)
Possible patch to use new XML Encoder

Jens Schumann added a comment - 18/Jul/03 01:04 AM
Added a new encoder. TestCase included.

Tested it using nusoap and local axis client (Mac OS X). Maybe we should ask sascha (see #15133)
if those changes still work for him.

Davanum Srinivas added a comment - 18/Jul/03 07:51 PM

Serge Knystautas made changes - 24/Feb/04 04:12 PM
Field Original Value New Value
issue.field.bugzillaimportkey 19327 14781
Thiago Jung Bauermann made changes - 16/Dec/05 01:41 AM
Link This issue is cloned as AXIS-2342 [ AXIS-2342 ]