Uploaded image for project: 'Xerces-C++'
  1. Xerces-C++
  2. XERCESC-2054

Grammar serialization not portable (integer size / alignment issue)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.1.4
    • None
    • None
    • None
    • Linux CentOS-7 (64bit), Windows 7 (64bit)

    Description

      Apologies if this is a known issue, but I have not found it by conventional
      means (i.e., google an searching through the bug data base here).

      I found that the serialisation/deserialisation (here: of grammars) is not as portable as it (IMHO) should be.

      The problem happens in XSerializeEngine::readString() when
      the length of the string is taken from the associated BinInputStream as
      "unsigned long":
      /***

      • Check if any data written
        ***/
        unsigned long tmp;
        *this>>tmp;

      On a Windows7 x64, MSVS2012, this will take 4 byte off the head of the stream,
      but on a CentOS 7 x64 (g++ 4.8.3), this will take 8 byte.

      As a consequence, a BinInputStream carefully encoded on Windows (e.g. putting
      it into a char array with
      examples/cxx/tree/embedded/grammar-input-stream.cxx
      which is a common xsd example)
      will fail when "reading" it on the Linux box, because everything from the first
      string on is garbage.

      Moreover, this will (probably) give no meaningful error message, just a
      "XSerialisationException" thrown, cause at some point it will (probably)
      misinterpret wchar data as length information and try to read the next string
      that is millions of bytes long (according to the misunderstood BinInputStream).
      The BinInputStream will then run out of bytes.

      A similar issue is present concerning the alignment of the data according to data type that happens for all >> operations: this is (necessarily) very
      platform dependent.

      It would be a big improvement, if xerces would encode the (de)serialization
      in a platform/compiler independent manner. The purpose after all IS to be portable, right?

      E.g., the serialisation engine could always use integers of known byte width
      (e.g.: #include <inttypes.h> -> use uint32_t) instead of "unsigned long".

      ALso, the alignment issue should be addressed; it is hard to predict
      what restrictions apply for the used compiler (or even processor) here, some are not capable to read an integer from a memory address that is not 4-byte aligned.
      E.g., the data could be copied (to a properly aligned item initialized by 0s)
      before doing the cast to an integer type.

      In any case, it should always be platform-independent how many bytes are next to be read from the BinaryInputStream.
      (Of course, the write operations have to follow the same business logic.)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              omoeller Oliver Moeller
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: