[SPARK-19256] Hive bucketing write support - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Target Version/s:

3.2.0

Description

Update (2020 by Cheng Su):

We use this JIRA to track progress for Hive bucketing write support in Spark. The goal is for Spark to write Hive bucketed table, to be compatible with other compute engines (Hive and Presto).

Current status for Hive bucketed table in Spark:

Not support for reading Hive bucketed table: read bucketed table as non-bucketed table.

Wrong behavior for writing Hive ORC and Parquet bucketed table: write orc/parquet bucketed table as non-bucketed table (code path: InsertIntoHadoopFsRelationCommand -> FileFormatWriter).

Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception by default if writing non-orc/parquet bucketed table (code path: InsertIntoHiveTable), and exception can be disabled by setting config `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will write as non-bucketed table.

Current status for Hive bucketed table in Hive:

Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash (https://issues.apache.org/jira/browse/HIVE-18910).

Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.

Hive on Tez: support zero and multiple files per bucket (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on read path - https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212 .

Current status for Hive bucketed table in Presto (take presto-sql here):

Support writing bucketed table with Hive murmur3hash and hivehash (https://github.com/prestosql/presto/pull/1697).

Support zero and multiple files per bucket (https://github.com/prestosql/presto/pull/822).

TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and Hive. Here with this JIRA, we need to add support writing Hive bucketed table with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 2.x.y).

To allow Spark efficiently read Hive bucketed table, this needs more radical change and we decide to wait until data source v2 supports bucketing, and do the read path on data source v2. Read path will not covered by this JIRA.

Original description (2017 by Tejas Patil):

JIRA to track design discussions and tasks related to Hive bucketing support in Spark.

Proposal : https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing

Attachments

Issue Links

is duplicated by

SPARK-21649 Support writing data into hive bucket table.

Resolved

links to

[Github] Pull Request #19001 (tejasapatil)

[Github] Pull Request #20206 (tejasapatil)

GitHub Pull Request #19001

GitHub Pull Request #20206

Sub-Tasks

1.	Hive hash implementation	Resolved	Tejas Patil
2.	Enable creating hive bucketed tables	Resolved	Tejas Patil
3.	Avoid Hash and Exchange in Sort Merge join if bucketing factor is multiple for tables	Resolved	Unassigned
4.	Configurable bucketing info extraction	Resolved	Unassigned
5.	Propagate bucketing information for Hive tables to / from Catalog	Resolved	Unassigned
6.	Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing	Open	Unassigned
7.	[SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort	Resolved	Cheng Su
8.	Introduce new API to FileCommitProtocol allow flexible file naming	Resolved	Cheng Su
9.	Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)	Resolved	Cheng Su
10.	Support writing Hive non-ORC/Parquet bucketed table	Resolved	Cheng Su
11.	Add Hive Murmur3Hash expression	Open	Unassigned
12.	Write Hive ORC/Parquet bucketed table with hive murmur3hash (for Hive 3)	Open	Unassigned
13.	Mark legacy file naming functions as deprecated in FileCommitProtocol	Resolved	Cheng Su

Activity

People

Assignee:: Unassigned

Reporter:: Tejas Patil

Votes:: 28 Vote for this issue

Watchers:: 82 Start watching this issue

Dates

Created:: 17/Jan/17 05:47

Updated:: 06/Mar/24 03:57