Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31703

Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.4.5, 3.0.0
    • 2.4.7, 3.0.1, 3.1.0
    • Spark Core
    • AIX 7.2
      LinuxPPC64 with RedHat.

    Description

      tTrying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) so as to be able to read data stored in parquet format, we notice that values associated with DOUBLE and DECIMAL types are parsed in the wrong form.

      According toe parquet documentation, they always opt to store the values using little-endian representation for values:
      https://github.com/apache/parquet-format/blob/master/Encodings.md

      The plain encoding is used whenever a more efficient encoding can not be used. It
      stores the data in the following format:
      
      BOOLEAN: Bit Packed, LSB first
      INT32: 4 bytes little endian
      INT64: 8 bytes little endian
      INT96: 12 bytes little endian (deprecated)
      FLOAT: 4 bytes IEEE little endian
      DOUBLE: 8 bytes IEEE little endian
      BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array
      FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
      
      For native types, this outputs the data as little endian. Floating
      point types are encoded in IEEE.
      For the byte array type, it encodes the length as a 4 byte little
      endian, followed by the bytes.

      Attachments

        1. Data_problem_Spark.gif
          2.37 MB
          Michail Giannakopoulos

        Issue Links

          Activity

            People

              tinhto-000 Tin Hang To
              miccagiann Michail Giannakopoulos
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: