Details
Description
Update (2020 by Cheng Su):
We use this JIRA to track progress for Hive bucketing write support in Spark. The goal is for Spark to write Hive bucketed table, to be compatible with other compute engines (Hive and Presto).
Current status for Hive bucketed table in Spark:
Not support for reading Hive bucketed table: read bucketed table as non-bucketed table.
Wrong behavior for writing Hive ORC and Parquet bucketed table: write orc/parquet bucketed table as non-bucketed table (code path: InsertIntoHadoopFsRelationCommand -> FileFormatWriter).
Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception by default if writing non-orc/parquet bucketed table (code path: InsertIntoHiveTable), and exception can be disabled by setting config `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will write as non-bucketed table.
Current status for Hive bucketed table in Hive:
Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash (https://issues.apache.org/jira/browse/HIVE-18910).
Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.
Hive on Tez: support zero and multiple files per bucket (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on read path - https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212 .
Current status for Hive bucketed table in Presto (take presto-sql here):
Support writing bucketed table with Hive murmur3hash and hivehash (https://github.com/prestosql/presto/pull/1697).
Support zero and multiple files per bucket (https://github.com/prestosql/presto/pull/822).
TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and Hive. Here with this JIRA, we need to add support writing Hive bucketed table with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 2.x.y).
To allow Spark efficiently read Hive bucketed table, this needs more radical change and we decide to wait until data source v2 supports bucketing, and do the read path on data source v2. Read path will not covered by this JIRA.
Original description (2017 by Tejas Patil):
JIRA to track design discussions and tasks related to Hive bucketing support in Spark.
Proposal : https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing
Attachments
Issue Links
- is duplicated by
-
SPARK-21649 Support writing data into hive bucket table.
- Resolved
- links to