Impala 2.2, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0
Reduce the memory requirements of INSERTs into partitioned tables.
Impala inserts into partitioned Parquet tables suffer from high memory requirements because each Impala Daemon will keep ~256MB of buffer space per open partition in the table sink. This often leads to large insert jobs hitting "Memory limit exceeded" errors. The behavior can be improved by pre-clustering the data such that only one partition needs to be buffered at a time in the table sink.
Add a new "clustered" plan hint for insert statements. Example:
CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT); INSERT INTO dst PARTITION(year,month) /*+ clustered */ SELECT * FROM src;
The hint specifies that the data fed into the table sink should be clustered based on the partition columns. For now, we'll use a sort to achieve clustering, and the plan should look like this:
SCAN -> SORT (year,month) -> TABLE SINK
Give users additional control over the insertion order.
In order to improve compression and/or the effectiveness of min/max pruning, it is desirable to control the order in which rows are inserted into table (mostly for Parquet).
Introduce a "sortby" plan hint for insert statements: Example
CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT); INSERT INTO dst PARTITION(year,month) /*+ clustered sortby(day,hour) */ SELECT * FROM src
This would produce the following plan:
SCAN -> SORT(year,month,day,hour) -> TABLE SINK
Improve the sort efficiency
The additional sorting step introduced by both solutions above should be as efficient as possible.
Codegen TupleRowComparator and Tuple::MaterializeExprs.
With more predictable and resource-efficient ETL users will extract more value out of Impala and will need to rely less on slow legacy ETL tools like Hive.
- is duplicated by
IMPALA-1803 Avoid hitting OOM in HdfsTableSink when inserting to Parquet