Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
ghx-label-8
Description
Currently Impala doesn't support LOAD DATA statements for Iceberg tables.
Some user workflows still use this statement, so it would be nice to implement it in some way.
The parameter to LOAD DATA can be a directory or a single file.
A possible solution would be to
- Create an external table
- If the parameter is a single file, then we can use
IMPALA-10934to define an external table on this single file - If the parameter is a directory, then we need to create an external table using the directory as table location. To get the table schema we could use CREATE TABLE LIKE PARQUET/ORC
- If the parameter is a single file, then we can use
- run an insert into iceberg_table select * from tmp_table
- drop the tmp table (not sure if we want to keep or remove the original files)
It does some copying, but probably this would be the safest solution.
Users might specify the partition columns in the [PARTITION (partcol1=val1, partcol2=val2 ...)] clause. In this case the data files don't necessarily contain the partition values, i.e. we need to create the tmp table with proper partitioning.
It's possible to create child queries for a single statement, see https://github.com/apache/impala/blob/master/be/src/service/child-query.h
Currently only COMPUTE STATS uses this. They are probably executed in parallel, but in this task we need to execute the above statements sequentially.