Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31703

Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)



    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.4.5, 3.0.0
    • Fix Version/s: 2.4.7, 3.0.1, 3.1.0
    • Component/s: Spark Core
    • Environment:

      AIX 7.2
      LinuxPPC64 with RedHat.


      tTrying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) so as to be able to read data stored in parquet format, we notice that values associated with DOUBLE and DECIMAL types are parsed in the wrong form.

      According toe parquet documentation, they always opt to store the values using little-endian representation for values:

      The plain encoding is used whenever a more efficient encoding can not be used. It
      stores the data in the following format:
      BOOLEAN: Bit Packed, LSB first
      INT32: 4 bytes little endian
      INT64: 8 bytes little endian
      INT96: 12 bytes little endian (deprecated)
      FLOAT: 4 bytes IEEE little endian
      DOUBLE: 8 bytes IEEE little endian
      BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array
      FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
      For native types, this outputs the data as little endian. Floating
      point types are encoded in IEEE.
      For the byte array type, it encodes the length as a 4 byte little
      endian, followed by the bytes.


        1. Data_problem_Spark.gif
          2.37 MB
          Michail Giannakopoulos

          Issue Links



              • Assignee:
                tinhto-000 Tin Hang To
                miccagiann Michail Giannakopoulos
              • Votes:
                0 Vote for this issue
                7 Start watching this issue


                • Created: