Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Bug
-
3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.3, 3.4.2, 3.3.2, 3.4.0, 3.4.1, 3.5.0, 3.5.1, 3.3.4, 3.4.3
-
None
-
None
Description
We found that Hive cannot concatenate some ORC files generated by Spark 3.2.1 and higher versions which contain long strings.
Steps to reproduce the issue:
1) Create DF with a string longer than 1024
val valid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 1024, 'A') as string;")
val invalid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 1025, 'A') as string;")
valid.withColumn("len", length($"string")).show() +---+----+--------------------+----+ | id|null| string| len| +---+----+--------------------+----+ | 1|null|AAAAAAAAAAAAAAAAA...|1024| +---+----+--------------------+----+
invalid.withColumn("len", length($"string")).show() +---+----+--------------------+----+ | id|null| string| len| +---+----+--------------------+----+ | 1|null|AAAAAAAAAAAAAAAAA...|1025| +---+----+--------------------+----+
2. Write in ORC format to S3
valid.write.format("orc") .option("path", "s3://bucket/test/test_orc/") .option("compression", "zlib") .mode("overwrite") .save()
3. Check ORC meta by hive –orcfiledump command
[hadoop@ip ~]$ hive --orcfiledump s3://bucket/tets/test_orc/
We can see incorrect statistics for column string
Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025
Processing data file s3://bucket-dev/tets/test_orc/part-00000-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orc [length: 488]Structure for s3://timmedia-dev/volodymyr/test_orc/part-00000-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orcFile Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: struct<id:int,null:string,string:string> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025 Stripes: Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 108 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 2 section ROW_INDEX start: 38 length 19 Stream: column 3 section ROW_INDEX start: 57 length 54 Stream: column 1 section DATA start: 111 length 6 Stream: column 2 section PRESENT start: 117 length 5 Stream: column 2 section DATA start: 122 length 0 Stream: column 2 section LENGTH start: 122 length 0 Stream: column 2 section DICTIONARY_DATA start: 122 length 0 Stream: column 3 section DATA start: 122 length 16 Stream: column 3 section LENGTH start: 138 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DICTIONARY_V2[0] Encoding column 3: DIRECT_V2 File length: 488 bytesPadding length: 0 bytesPadding ratio: 0% User Metadata: org.apache.spark.version=3.4.1
For DF with a value smaller than 1024, we can see valid statistics
hive --orcfiledump s3://bucket/test/test_orcProcessing data file s3://bucket/test/test_orc/part-00000-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orc [length: 485]Structure for s3://timmedia-dev/volodymyr/test_orc/part-00000-e395cc4d-9e2a-4ef0-9adb-640ed41dd2b7-c000.zlib.orcFile Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: struct<id:int,null:string,string:string> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA max: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA sum: 1024 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA max: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA sum: 1024 Stripes: Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 107 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 2 section ROW_INDEX start: 38 length 19 Stream: column 3 section ROW_INDEX start: 57 length 53 Stream: column 1 section DATA start: 110 length 6 Stream: column 2 section PRESENT start: 116 length 5 Stream: column 2 section DATA start: 121 length 0 Stream: column 2 section LENGTH start: 121 length 0 Stream: column 2 section DICTIONARY_DATA start: 121 length 0 Stream: column 3 section DATA start: 121 length 16 Stream: column 3 section LENGTH start: 137 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DICTIONARY_V2[0] Encoding column 3: DIRECT_V2 File length: 485 bytesPadding length: 0 bytesPadding ratio: 0% User Metadata: org.apache.spark.version=3.4.1________________________________________________________________________________________________________________________