Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-11339

Implement LOAD DATA INPATH for Iceberg tables

    XMLWordPrintableJSON

Details

    Description

      Currently Impala doesn't support LOAD DATA statements for Iceberg tables.

      Some user workflows still use this statement, so it would be nice to implement it in some way.

      The parameter to LOAD DATA can be a directory or a single file.

      A possible solution would be to

      1. Create an external table
        1. If the parameter is a single file, then we can use IMPALA-10934 to define an external table on this single file
        2. If the parameter is a directory, then we need to create an external table using the directory as table location. To get the table schema we could use CREATE TABLE LIKE PARQUET/ORC
      2. run an insert into iceberg_table select * from tmp_table
      3. drop the tmp table (not sure if we want to keep or remove the original files)

      It does some copying, but probably this would be the safest solution.

      Users might specify the partition columns in the [PARTITION (partcol1=val1, partcol2=val2 ...)] clause. In this case the data files don't necessarily contain the partition values, i.e. we need to create the tmp table with proper partitioning.

      It's possible to create child queries for a single statement, see https://github.com/apache/impala/blob/master/be/src/service/child-query.h
      Currently only COMPUTE STATS uses this. They are probably executed in parallel, but in this task we need to execute the above statements sequentially.

      Attachments

        Activity

          People

            tmate Tamas Mate
            boroknagyz Zoltán Borók-Nagy
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: