Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15691 Refactor and improve Hive support
  3. SPARK-18243

Converge the insert path of Hive tables with data source tables

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.2.0
    • SQL
    • None

    Description

      Inserting data into Hive tables has its own implementation that is distinct from data sources: InsertIntoHiveTable, SparkHiveWriterContainer and SparkHiveDynamicPartitionWriterContainer.

      I think it should be possible to unify these with data source implementations InsertIntoHadoopFsRelationCommand. We can start by implementing an OutputWriterFactory/OutputWriter that uses Hive's serdes to write data.

      Note that one other major difference is that data source tables write directly to the final destination without using some staging directory, and then Spark itself adds the partitions/tables to the catalog. Hive tables actually write to some staging directory, and then call Hive metastore's loadPartition/loadTable function to load those data in.

      Attachments

        Activity

          People

            cloud_fan Wenchen Fan
            rxin Reynold Xin
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: