Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2522

Improve the reliability and effectiveness of ETL

    XMLWordPrintableJSON

Details

    • Epic
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.2, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0
    • None
    • Backend
    • Improve the reliability and effectiveness of ETL

    Description

      Reduce the memory requirements of INSERTs into partitioned tables.

      Impala inserts into partitioned Parquet tables suffer from high memory requirements because each Impala Daemon will keep ~256MB of buffer space per open partition in the table sink. This often leads to large insert jobs hitting "Memory limit exceeded" errors. The behavior can be improved by pre-clustering the data such that only one partition needs to be buffered at a time in the table sink.

      Add a new "clustered" plan hint for insert statements. Example:

      CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT);
      INSERT INTO dst PARTITION(year,month) /*+ clustered */ SELECT * FROM src;
      

      The hint specifies that the data fed into the table sink should be clustered based on the partition columns. For now, we'll use a sort to achieve clustering, and the plan should look like this:
      SCAN -> SORT (year,month) -> TABLE SINK

      Give users additional control over the insertion order.

      In order to improve compression and/or the effectiveness of min/max pruning, it is desirable to control the order in which rows are inserted into table (mostly for Parquet).

      Introduce a "sortby" plan hint for insert statements: Example

      CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT);
      INSERT INTO dst PARTITION(year,month) /*+ clustered sortby(day,hour) */ SELECT * FROM src
      

      This would produce the following plan:
      SCAN -> SORT(year,month,day,hour) -> TABLE SINK

      Improve the sort efficiency

      The additional sorting step introduced by both solutions above should be as efficient as possible.

      Codegen TupleRowComparator and Tuple::MaterializeExprs.

      Summary

      With more predictable and resource-efficient ETL users will extract more value out of Impala and will need to rely less on slow legacy ETL tools like Hive.

      Attachments

        Issue Links

          Activity

            People

              lv Lars Volker
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: