Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20297

Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.1.0
    • None
    • SQL

    Description

      While trying to load some data using Spark 2.1 I realized that decimal(12,2) columns stored in Parquet written by Spark are not readable by Hive or Impala.

      Repro

      CREATE TABLE customer_acctbal(
        c_acctbal decimal(12,2))
      STORED AS Parquet;
      
      insert into customer_acctbal values (7539.95);
      

      Error from Hive

      Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-00000-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
      Time taken: 0.122 seconds
      

      Error from Impala

      File 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-00000-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet' has an incompatible Parquet schema for column 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: DECIMAL(12,2), Parquet schema:
      optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
      

      Table info

      hive> describe formatted customer_acctbal;
      OK
      # col_name              data_type               comment
      
      c_acctbal               decimal(12,2)
      
      # Detailed Table Information
      Database:               tpch_nested_3000_parquet
      Owner:                  mmokhtar
      CreateTime:             Mon Apr 10 17:47:24 PDT 2017
      LastAccessTime:         UNKNOWN
      Protect Mode:           None
      Retention:              0
      Location:               hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
      Table Type:             MANAGED_TABLE
      Table Parameters:
              COLUMN_STATS_ACCURATE   true
              numFiles                1
              numRows                 0
              rawDataSize             0
              totalSize               120
              transient_lastDdlTime   1491871644
      
      # Storage Information
      SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
      InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
      OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
      Compressed:             No
      Num Buckets:            -1
      Bucket Columns:         []
      Sort Columns:           []
      Storage Desc Params:
              serialization.format    1
      Time taken: 0.032 seconds, Fetched: 31 row(s)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            mmokhtar Mostafa Mokhtar
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: