Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20697

MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.0, 2.2.0, 2.2.1, 2.3.0
    • None
    • SQL

    Description

      MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table does not restore the bucketing information to the storage descriptor in the metastore.

      Steps to reproduce:
      1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

      2) In Hive-CLI issue a desc formatted for the table.

      1. col_name data_type comment

      a int

      1. Partition Information
      2. col_name data_type comment

      b int

      1. Detailed Table Information
        Database: sparkhivebucket
        Owner: devbld
        CreateTime: Wed May 10 10:31:07 PDT 2017
        LastAccessTime: UNKNOWN
        Protect Mode: None
        Retention: 0
        Location: hdfs://localhost:8020/user/hive/warehouse/partbucket
        Table Type: MANAGED_TABLE
        Table Parameters:
        transient_lastDdlTime 1494437467
      1. Storage Information
        SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        Compressed: No
        Num Buckets: 10
        Bucket Columns: [a]
        Sort Columns: []
        Storage Desc Params:
        field.delim ,
        serialization.format ,

      3) In spark-shell,

      scala> spark.sql("MSCK REPAIR TABLE partbucket")

      4) Back to Hive-CLI

      desc formatted partbucket;

      1. col_name data_type comment

      a int

      1. Partition Information
      2. col_name data_type comment

      b int

      1. Detailed Table Information
        Database: sparkhivebucket
        Owner: devbld
        CreateTime: Wed May 10 10:31:07 PDT 2017
        LastAccessTime: UNKNOWN
        Protect Mode: None
        Retention: 0
        Location: hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket
        Table Type: MANAGED_TABLE
        Table Parameters:
        spark.sql.partitionProvider catalog
        transient_lastDdlTime 1494437647
      1. Storage Information
        SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        Compressed: No
        Num Buckets: -1
        Bucket Columns: []
        Sort Columns: []
        Storage Desc Params:
        field.delim ,
        serialization.format ,

      Further inserts to this table cannot be made in bucketed fashion through Hive.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              abhimadav Abhishek Madav
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: