[XERCESC-2054] Grammar serialization not portable (integer size / alignment issue) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.1.4
Fix Version/s: None
Component/s: None
Labels:
None
Environment:
Linux CentOS-7 (64bit), Windows 7 (64bit)

Description

Apologies if this is a known issue, but I have not found it by conventional
means (i.e., google an searching through the bug data base here).

I found that the serialisation/deserialisation (here: of grammars) is not as portable as it (IMHO) should be.

The problem happens in XSerializeEngine::readString() when
the length of the string is taken from the associated BinInputStream as
"unsigned long":
/***

Check if any data written
***/
unsigned long tmp;
*this>>tmp;

On a Windows7 x64, MSVS2012, this will take 4 byte off the head of the stream,
but on a CentOS 7 x64 (g++ 4.8.3), this will take 8 byte.

As a consequence, a BinInputStream carefully encoded on Windows (e.g. putting
it into a char array with
examples/cxx/tree/embedded/grammar-input-stream.cxx
which is a common xsd example)
will fail when "reading" it on the Linux box, because everything from the first
string on is garbage.

Moreover, this will (probably) give no meaningful error message, just a
"XSerialisationException" thrown, cause at some point it will (probably)
misinterpret wchar data as length information and try to read the next string
that is millions of bytes long (according to the misunderstood BinInputStream).
The BinInputStream will then run out of bytes.

A similar issue is present concerning the alignment of the data according to data type that happens for all >> operations: this is (necessarily) very
platform dependent.

It would be a big improvement, if xerces would encode the (de)serialization
in a platform/compiler independent manner. The purpose after all IS to be portable, right?

E.g., the serialisation engine could always use integers of known byte width
(e.g.: #include <inttypes.h> -> use uint32_t) instead of "unsigned long".

ALso, the alignment issue should be addressed; it is hard to predict
what restrictions apply for the used compiler (or even processor) here, some are not capable to read an integer from a memory address that is not 4-byte aligned.
E.g., the data could be copied (to a properly aligned item initialized by 0s)
before doing the cast to an integer type.

In any case, it should always be platform-independent how many bytes are next to be read from the BinaryInputStream.
(Of course, the write operations have to follow the same business logic.)

Attachments

Issue Links

is duplicated by

XERCESC-1959 serializeGrammars does not work between 32 and 64 bit systems

Open

Activity

People

Assignee:: Unassigned

Reporter:: Oliver Moeller

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Nov/15 15:54

Updated:: 12/Jul/17 22:55

Resolved:: 12/Jul/17 22:55