[IMPALA-2522] Improve the reliability and effectiveness of ETL - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Epic
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 2.2, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0
Fix Version/s: None
Component/s: Backend
Labels:
- ETL
- performance

Epic Name:
Improve the reliability and effectiveness of ETL
Target Version:

Product Backlog

Description

Reduce the memory requirements of INSERTs into partitioned tables.

Impala inserts into partitioned Parquet tables suffer from high memory requirements because each Impala Daemon will keep ~256MB of buffer space per open partition in the table sink. This often leads to large insert jobs hitting "Memory limit exceeded" errors. The behavior can be improved by pre-clustering the data such that only one partition needs to be buffered at a time in the table sink.

Add a new "clustered" plan hint for insert statements. Example:

CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT);
INSERT INTO dst PARTITION(year,month) /*+ clustered */ SELECT * FROM src;

The hint specifies that the data fed into the table sink should be clustered based on the partition columns. For now, we'll use a sort to achieve clustering, and the plan should look like this:
SCAN -> SORT (year,month) -> TABLE SINK

Give users additional control over the insertion order.

In order to improve compression and/or the effectiveness of min/max pruning, it is desirable to control the order in which rows are inserted into table (mostly for Parquet).

Introduce a "sortby" plan hint for insert statements: Example

CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT);
INSERT INTO dst PARTITION(year,month) /*+ clustered sortby(day,hour) */ SELECT * FROM src

This would produce the following plan:
SCAN -> SORT(year,month,day,hour) -> TABLE SINK

Improve the sort efficiency

The additional sorting step introduced by both solutions above should be as efficient as possible.

Codegen TupleRowComparator and Tuple::MaterializeExprs.

Summary

With more predictable and resource-efficient ETL users will extract more value out of Impala and will need to rely less on slow legacy ETL tools like Hive.

Attachments

Issue Links

is duplicated by

IMPALA-1803 Avoid hitting OOM in HdfsTableSink when inserting to Parquet

Resolved

Activity

People

Assignee:: Lars Volker

Reporter:: Mostafa Mokhtar

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 09/Oct/15 19:01

Updated:: 13/Jun/20 00:03

Resolved:: 13/Jun/20 00:03