[SPARK-32281] Spark wipes out SORTED spec in metastore when DESC is used - ASF JIRA

Details

Type: Bug
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

When altering a Hive bucketed table or updating its statistics, Spark will wipe out the SORTED specification in the metastore if the specification uses DESC.

For example:

0: jdbc:hive2://localhost:10000> -- in beeline
0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
No rows affected (0.045 seconds)
0: jdbc:hive2://localhost:10000> show create table bucketed;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `bucketed`(                           |
|   `a` int,                                         |
|   `b` int,                                         |
|   `c` int,                                         |
|   `d` int)                                         |
| CLUSTERED BY (                                     |
|   c)                                               |
| SORTED BY (                                        |
|   c ASC,                                           |
|   d DESC)                                          |
| INTO 10 BUCKETS                                    |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1594488043')            |
+----------------------------------------------------+
21 rows selected (0.042 seconds)
0: jdbc:hive2://localhost:10000> 
-
-
-

scala> // in spark

scala> sql("alter table bucketed set tblproperties ('foo'='bar')")
20/07/11 10:21:36 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/07/11 10:21:38 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
res0: org.apache.spark.sql.DataFrame = []

scala> 
-
-
-

0: jdbc:hive2://localhost:10000> -- back in beeline
0: jdbc:hive2://localhost:10000> show create table bucketed;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `bucketed`(                           |
|   `a` int,                                         |
|   `b` int,                                         |
|   `c` int,                                         |
|   `d` int)                                         |
| CLUSTERED BY (                                     |
|   c)                                               |
| INTO 10 BUCKETS                                    |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
| TBLPROPERTIES (                                    |
|   'foo'='bar',                                     |
|   'spark.sql.partitionProvider'='catalog',         |
|   'transient_lastDdlTime'='1594488098')            |
+----------------------------------------------------+
20 rows selected (0.038 seconds)
0: jdbc:hive2://localhost:10000>

Note that the SORTED specification disappears.

Another example, this time using insert:

0: jdbc:hive2://localhost:10000> -- in beeline
0: jdbc:hive2://localhost:10000> create table bucketed (a int, b int, c int, d int) clustered by (c) sorted by (c asc, d desc) into 10 buckets;
No rows affected (0.055 seconds)
0: jdbc:hive2://localhost:10000> insert into table bucketed values (0, 1, 2, 3);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (1.689 seconds)
0: jdbc:hive2://localhost:10000> analyze table bucketed compute statistics;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (1.516 seconds)
0: jdbc:hive2://localhost:10000> show create table bucketed;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `bucketed`(                           |
|   `a` int,                                         |
|   `b` int,                                         |
|   `c` int,                                         |
|   `d` int)                                         |
| CLUSTERED BY (                                     |
|   c)                                               |
| SORTED BY (                                        |
|   c ASC,                                           |
|   d DESC)                                          |
| INTO 10 BUCKETS                                    |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1594488191')            |
+----------------------------------------------------+
21 rows selected (0.078 seconds)
0: jdbc:hive2://localhost:10000> 
-
-
-

scala> // in spark

scala> sql("set hive.enforce.sorting=false")
20/07/11 10:23:57 WARN SetCommand: 'SET hive.enforce.sorting=false' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.enforce.sorting) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
res0: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> sql("set hive.enforce.bucketing=false")
20/07/11 10:24:01 WARN SetCommand: 'SET hive.enforce.bucketing=false' might not work, since Spark doesn't support changing the Hive config dynamically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.enforce.bucketing) when starting a Spark application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.range(0,1000).map { x => (x, x + 1, x + 2, x + 3) }.
  toDF("a", "b", "c", "d").createOrReplaceTempView("df")

     | 
scala> 
scala> sql("insert into bucketed select * from df")
20/07/11 10:24:15 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/07/11 10:24:16 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
20/07/11 10:24:16 WARN InsertIntoHiveTable: Output Hive table `default`.`bucketed` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive. Inserting data anyways since both hive.enforce.bucketing and hive.enforce.sorting are set to false.
20/07/11 10:24:19 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
res3: org.apache.spark.sql.DataFrame = []

scala> 
-
-
-

0: jdbc:hive2://localhost:10000> -- back in beeline
0: jdbc:hive2://localhost:10000> show create table bucketed;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `bucketed`(                           |
|   `a` int,                                         |
|   `b` int,                                         |
|   `c` int,                                         |
|   `d` int)                                         |
| CLUSTERED BY (                                     |
|   c)                                               |
| INTO 10 BUCKETS                                    |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'file:/Users/bruce/hadoop/apache-hive-2.3.7-bin/warehouse/bucketed' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1594488259')            |
+----------------------------------------------------+
18 rows selected (0.041 seconds)
0: jdbc:hive2://localhost:10000>

Note that the SORTED specification disappears.

Attachments

Issue Links

links to

[Github] Pull Request #30011 (AngersZhuuuu)

Spark wipes out SORTED spec in metastore when DESC is used

Details

Description

Attachments

Issue Links

Activity

People

Dates