Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32281

Spark wipes out SORTED spec in metastore when DESC is used

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      When altering a Hive bucketed table or updating its statistics, Spark will wipe out the SORTED specification in the metastore if the specification uses DESC.

       For example:

      0: jdbc:hive2://localhost:10000> -- in beeline
      0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
      No rows affected (0.045 seconds)
      0: jdbc:hive2://localhost:10000> show create table bucketed;
      +----------------------------------------------------+
      |                   createtab_stmt                   |
      +----------------------------------------------------+
      | CREATE TABLE `bucketed`(                           |
      |   `a` int,                                         |
      |   `b` int,                                         |
      |   `c` int,                                         |
      |   `d` int)                                         |
      | CLUSTERED BY (                                     |
      |   c)                                               |
      | SORTED BY (                                        |
      |   c ASC,                                           |
      |   d DESC)                                          |
      | INTO 10 BUCKETS                                    |
      | ROW FORMAT SERDE                                   |
      |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
      | STORED AS INPUTFORMAT                              |
      |   'org.apache.hadoop.mapred.TextInputFormat'       |
      | OUTPUTFORMAT                                       |
      |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
      | LOCATION                                           |
      |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
      | TBLPROPERTIES (                                    |
      |   'transient_lastDdlTime'='1594488043')            |
      +----------------------------------------------------+
      21 rows selected (0.042 seconds)
      0: jdbc:hive2://localhost:10000> 
      -
      -
      -
      
      scala> // in spark
      
      scala> sql("alter table bucketed set tblproperties ('foo'='bar')")
      20/07/11 10:21:36 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
      20/07/11 10:21:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
      res0: org.apache.spark.sql.DataFrame = []
      
      scala> 
      -
      -
      -
      
      0: jdbc:hive2://localhost:10000> -- back in beeline
      0: jdbc:hive2://localhost:10000> show create table bucketed;
      +----------------------------------------------------+
      |                   createtab_stmt                   |
      +----------------------------------------------------+
      | CREATE TABLE `bucketed`(                           |
      |   `a` int,                                         |
      |   `b` int,                                         |
      |   `c` int,                                         |
      |   `d` int)                                         |
      | CLUSTERED BY (                                     |
      |   c)                                               |
      | INTO 10 BUCKETS                                    |
      | ROW FORMAT SERDE                                   |
      |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
      | STORED AS INPUTFORMAT                              |
      |   'org.apache.hadoop.mapred.TextInputFormat'       |
      | OUTPUTFORMAT                                       |
      |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
      | LOCATION                                           |
      |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
      | TBLPROPERTIES (                                    |
      |   'foo'='bar',                                     |
      |   'spark.sql.partitionProvider'='catalog',         |
      |   'transient_lastDdlTime'='1594488098')            |
      +----------------------------------------------------+
      20 rows selected (0.038 seconds)
      0: jdbc:hive2://localhost:10000> 
      

      Note that the SORTED specification disappears.

      Another example, this time using insert:

      0: jdbc:hive2://localhost:10000> -- in beeline
      0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
      No rows affected (0.055 seconds)
      0: jdbc:hive2://localhost:10000> insert into table bucketed values (0, 1, 2, 3);
      WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
      No rows affected (1.689 seconds)
      0: jdbc:hive2://localhost:10000> analyze table bucketed compute statistics;
      WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
      No rows affected (1.516 seconds)
      0: jdbc:hive2://localhost:10000> show create table bucketed;
      +----------------------------------------------------+
      |                   createtab_stmt                   |
      +----------------------------------------------------+
      | CREATE TABLE `bucketed`(                           |
      |   `a` int,                                         |
      |   `b` int,                                         |
      |   `c` int,                                         |
      |   `d` int)                                         |
      | CLUSTERED BY (                                     |
      |   c)                                               |
      | SORTED BY (                                        |
      |   c ASC,                                           |
      |   d DESC)                                          |
      | INTO 10 BUCKETS                                    |
      | ROW FORMAT SERDE                                   |
      |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
      | STORED AS INPUTFORMAT                              |
      |   'org.apache.hadoop.mapred.TextInputFormat'       |
      | OUTPUTFORMAT                                       |
      |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
      | LOCATION                                           |
      |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
      | TBLPROPERTIES (                                    |
      |   'transient_lastDdlTime'='1594488191')            |
      +----------------------------------------------------+
      21 rows selected (0.078 seconds)
      0: jdbc:hive2://localhost:10000> 
      -
      -
      -
      
      scala> // in spark
      
      scala> sql("set hive.enforce.sorting=false")
      20/07/11 10:23:57 WARN SetCommand: 'SET hive.enforce.sorting=false' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.enforce.sorting) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
      res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
      
      scala> sql("set hive.enforce.bucketing=false")
      20/07/11 10:24:01 WARN SetCommand: 'SET hive.enforce.bucketing=false' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.enforce.bucketing) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
      res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
      
      scala> spark.range(0,1000).map { x => (x, x + 1, x + 2, x + 3) }.
        toDF("a", "b", "c", "d").createOrReplaceTempView("df")
      
           | 
      scala> 
      scala> sql("insert into bucketed select * from df")
      20/07/11 10:24:15 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
      20/07/11 10:24:16 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
      20/07/11 10:24:16 WARN InsertIntoHiveTable: Output Hive table `default`.`bucketed` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive. Inserting data anyways since both hive.enforce.bucketing and hive.enforce.sorting are set to false.
      20/07/11 10:24:19 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
      res3: org.apache.spark.sql.DataFrame = []
      
      scala> 
      -
      -
      -
      
      0: jdbc:hive2://localhost:10000> -- back in beeline
      0: jdbc:hive2://localhost:10000> show create table bucketed;
      +----------------------------------------------------+
      |                   createtab_stmt                   |
      +----------------------------------------------------+
      | CREATE TABLE `bucketed`(                           |
      |   `a` int,                                         |
      |   `b` int,                                         |
      |   `c` int,                                         |
      |   `d` int)                                         |
      | CLUSTERED BY (                                     |
      |   c)                                               |
      | INTO 10 BUCKETS                                    |
      | ROW FORMAT SERDE                                   |
      |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
      | STORED AS INPUTFORMAT                              |
      |   'org.apache.hadoop.mapred.TextInputFormat'       |
      | OUTPUTFORMAT                                       |
      |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
      | LOCATION                                           |
      |   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
      | TBLPROPERTIES (                                    |
      |   'transient_lastDdlTime'='1594488259')            |
      +----------------------------------------------------+
      18 rows selected (0.041 seconds)
      0: jdbc:hive2://localhost:10000> 
      

      Note that the SORTED specification disappears.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bersprockets Bruce Robbins
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: