Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19256

Hive bucketing write support

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0
    • None
    • SQL
    • None

    Description

      Update (2020 by Cheng Su):

      We use this JIRA to track progress for Hive bucketing write support in Spark. The goal is for Spark to write Hive bucketed table, to be compatible with other compute engines (Hive and Presto).

       

      Current status for Hive bucketed table in Spark:

      Not support for reading Hive bucketed table: read bucketed table as non-bucketed table.

      Wrong behavior for writing Hive ORC and Parquet bucketed table: write orc/parquet bucketed table as non-bucketed table (code path: InsertIntoHadoopFsRelationCommand -> FileFormatWriter).

      Do not allow for writing Hive non-ORC/Parquet bucketed table: throw exception by default if writing non-orc/parquet bucketed table (code path: InsertIntoHiveTable), and exception can be disabled by setting config `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will write as non-bucketed table.

       

      Current status for Hive bucketed table in Hive:

      Hive 3.0.0 and after: support writing bucketed table with Hive murmur3hash (https://issues.apache.org/jira/browse/HIVE-18910).

      Hive 1.x.y and 2.x.y: support writing bucketed table with Hive hivehash.

      Hive on Tez: support zero and multiple files per bucket (https://issues.apache.org/jira/browse/HIVE-14014). And more code pointer on read path - https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/metainfo/annotation/OpTraitsRulesProcFactory.java#L183-L212 .

       

      Current status for Hive bucketed table in Presto (take presto-sql here):

      Support writing bucketed table with Hive murmur3hash and hivehash (https://github.com/prestosql/presto/pull/1697).

      Support zero and multiple files per bucket (https://github.com/prestosql/presto/pull/822).

       

      TLDR is to achieve Hive bucketed table compatibility across Spark, Presto and Hive. Here with this JIRA, we need to add support writing Hive bucketed table with Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 2.x.y).

       

      To allow Spark efficiently read Hive bucketed table, this needs more radical change and we decide to wait until data source v2 supports bucketing, and do the read path on data source v2. Read path will not covered by this JIRA.

       

      Original description (2017 by Tejas Patil):

      JIRA to track design discussions and tasks related to Hive bucketing support in Spark.

      Proposal : https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing

      Attachments

        Activity

          People

            Unassigned Unassigned
            tejasp Tejas Patil
            Votes:
            28 Vote for this issue
            Watchers:
            82 Start watching this issue

            Dates

              Created:
              Updated: