[SPARK-20297] Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: SQL
Labels:
- integration

Description

While trying to load some data using Spark 2.1 I realized that decimal(12,2) columns stored in Parquet written by Spark are not readable by Hive or Impala.

Repro

CREATE TABLE customer_acctbal(
  c_acctbal decimal(12,2))
STORED AS Parquet;

insert into customer_acctbal values (7539.95);

Error from Hive

Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-00000-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
Time taken: 0.122 seconds

Error from Impala

File 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-00000-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' has an incompatible Parquet schema for column 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: DECIMAL(12,2), Parquet schema:
optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)

Table info

hive> describe formatted customer_acctbal;
OK
# col_name              data_type               comment

c_acctbal               decimal(12,2)

# Detailed Table Information
Database:               tpch_nested_3000_parquet
Owner:                  mmokhtar
CreateTime:             Mon Apr 10 17:47:24 PDT 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   true
        numFiles                1
        numRows                 0
        rawDataSize             0
        totalSize               120
        transient_lastDdlTime   1491871644

# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.032 seconds, Fetched: 31 row(s)

Attachments

Issue Links

links to

spark.sql.parquet.writeLegacyFormat

Activity

People

Assignee:: Unassigned

Reporter:: Mostafa Mokhtar

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Apr/17 17:18

Updated:: 12/Dec/22 18:10

Resolved:: 19/Apr/17 23:59