Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48460

Spark ORC writer generates incorrect meta information(min, max)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Bug
    • 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.3, 3.4.2, 3.3.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1, 3.3.4, 3.4.3
    • None
    • Input/Output
    • None

    Description

      We found that Hive cannot concatenate some ORC files generated by Spark 3.2.1 and higher versions which contain long strings.

      Steps to reproduce the issue:
      1) Create DF with a string longer than 1024
       

      val valid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 1024, 'A') as string;")
      val invalid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 1025, 'A') as string;")
      valid.withColumn("len", length($"string")).show()
      +---+----+--------------------+----+ | id|null| string| len| +---+----+--------------------+----+ | 1|null|AAAAAAAAAAAAAAAAA...|1024| +---+----+--------------------+----+
      invalid.withColumn("len", length($"string")).show()
      +---+----+--------------------+----+ | id|null| string| len| +---+----+--------------------+----+ | 1|null|AAAAAAAAAAAAAAAAA...|1025| +---+----+--------------------+----+

      2. Write in ORC format to S3

      valid.write.format("orc")
            .option("path", "s3://bucket/test/test_orc/")
            .option("compression", "zlib")
            .mode("overwrite")
            .save()

      3. Check ORC meta by hive –orcfiledump command

      [hadoop@ip ~]$ hive --orcfiledump s3://bucket/tets/test_orc/

      We can see incorrect statistics for column string

      Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025
      Processing data file s3://bucket-dev/tets/test_orc/part-00000-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orc [length: 488]Structure for s3://timmedia-dev/volodymyr/test_orc/part-00000-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orcFile Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: struct<id:int,null:string,string:string>
      Stripe Statistics:  Stripe 1:    Column 0: count: 1 hasNull: false    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1    Column 2: count: 0 hasNull: true bytesOnDisk: 5    Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025
      File Statistics:  Column 0: count: 1 hasNull: false  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1  Column 2: count: 0 hasNull: true bytesOnDisk: 5  Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025
      Stripes:  Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 108    Stream: column 0 section ROW_INDEX start: 3 length 11    Stream: column 1 section ROW_INDEX start: 14 length 24    Stream: column 2 section ROW_INDEX start: 38 length 19    Stream: column 3 section ROW_INDEX start: 57 length 54    Stream: column 1 section DATA start: 111 length 6    Stream: column 2 section PRESENT start: 117 length 5    Stream: column 2 section DATA start: 122 length 0    Stream: column 2 section LENGTH start: 122 length 0    Stream: column 2 section DICTIONARY_DATA start: 122 length 0    Stream: column 3 section DATA start: 122 length 16    Stream: column 3 section LENGTH start: 138 length 7    Encoding column 0: DIRECT    Encoding column 1: DIRECT_V2    Encoding column 2: DICTIONARY_V2[0]    Encoding column 3: DIRECT_V2
      File length: 488 bytesPadding length: 0 bytesPadding ratio: 0%
      User Metadata:  org.apache.spark.version=3.4.1

      For DF with a value smaller than 1024, we can see valid statistics

      hive --orcfiledump s3://bucket/test/test_orcProcessing data file s3://bucket/test/test_orc/part-00000-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orc [length: 485]Structure for s3://timmedia-dev/volodymyr/test_orc/part-00000-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orcFile Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: struct<id:int,null:string,string:string>
      Stripe Statistics:  Stripe 1:    Column 0: count: 1 hasNull: false    Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1    Column 2: count: 0 hasNull: true bytesOnDisk: 5    Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA max: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA sum: 1024
      File Statistics:  Column 0: count: 1 hasNull: false  Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1  Column 2: count: 0 hasNull: true bytesOnDisk: 5  Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA max: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA sum: 1024
      Stripes:  Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 107    Stream: column 0 section ROW_INDEX start: 3 length 11    Stream: column 1 section ROW_INDEX start: 14 length 24    Stream: column 2 section ROW_INDEX start: 38 length 19    Stream: column 3 section ROW_INDEX start: 57 length 53    Stream: column 1 section DATA start: 110 length 6    Stream: column 2 section PRESENT start: 116 length 5    Stream: column 2 section DATA start: 121 length 0    Stream: column 2 section LENGTH start: 121 length 0    Stream: column 2 section DICTIONARY_DATA start: 121 length 0    Stream: column 3 section DATA start: 121 length 16    Stream: column 3 section LENGTH start: 137 length 7    Encoding column 0: DIRECT    Encoding column 1: DIRECT_V2    Encoding column 2: DICTIONARY_V2[0]    Encoding column 3: DIRECT_V2
      File length: 485 bytesPadding length: 0 bytesPadding ratio: 0%
      User Metadata:  org.apache.spark.version=3.4.1________________________________________________________________________________________________________________________ 

       
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            tatianchuk Volodymyr T
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: