Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2521

Introduce CLUSTERED plan hint for insert statements

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 2.2, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Frontend

      Description

      Add a new "clustered" plan hint for insert statements.

      Example:

      CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT);
      INSERT INTO dst PARTITION(year,month) /*+ clustered */ SELECT * FROM src;
      

      The hint specifies that the data fed into the table sink should be clustered based on the partition columns.
      For now, we'll use a local sort to achieve clustering, and the plan should look like this:
      SCAN -> SORT (year,month) -> TABLE SINK

      Syntax and behavior

      INSERT INTO dst PARTITION(year,month) /*+ clustered */ SELECT * FROM src;
      
      • We will not support the legacy-hint style with brackets
        [clustered]
      • The hint should be obeyed if the target table is a partitioned HDFS or Kudu table. Otherwise, it should be ignored with a warning.
      • For Kudu tables, the sorting should be done on the primary keys.

      Making clustered the default plan

      Eventually, we want to make the "clustered" plan the default because it is more robust with large inserts into many partitions. With that in mind, we should also add a corresponding "noclustered" hint that removes the sort. Of course, that hint will not do anything until we change the default behavior, but we should add it nevertheless to have the hints complete.

        Issue Links

          Activity

          Hide
          szama_impala_6295 Marcell Szabo added a comment -

          To avoid confusion, it might make sense to make this clause part of the INSERT statement with the following order:

          INSERT INTO TABLE b SORT BY x
          SELECT x, y FROM a

          it would hint the users that the SORT is really about how the rows are written into the table, and not the query itself. Also the grammar definition would be slightly cleaner.

          Show
          szama_impala_6295 Marcell Szabo added a comment - To avoid confusion, it might make sense to make this clause part of the INSERT statement with the following order: INSERT INTO TABLE b SORT BY x SELECT x, y FROM a it would hint the users that the SORT is really about how the rows are written into the table, and not the query itself. Also the grammar definition would be slightly cleaner.
          Hide
          tarmstrong Tim Armstrong added a comment -

          This should be more viable now that the sorter's buffer management works.

          Show
          tarmstrong Tim Armstrong added a comment - This should be more viable now that the sorter's buffer management works.
          Hide
          dhecht Dan Hecht added a comment -

          I don't think there's anything needed in the backend, is there? Our operator is already a local sort operator (the merge is handled by the exchange).

          Show
          dhecht Dan Hecht added a comment - I don't think there's anything needed in the backend, is there? Our operator is already a local sort operator (the merge is handled by the exchange).
          Hide
          alex.behm Alexander Behm added a comment -

          Dan Hecht, we'll need to make changes to the table sinks to exploit the sorted input stream, but you are right that the bulk of the work is in the FE.

          Show
          alex.behm Alexander Behm added a comment - Dan Hecht , we'll need to make changes to the table sinks to exploit the sorted input stream, but you are right that the bulk of the work is in the FE.
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          Dimitris Tsirogiannis
          This should help Kudu ETL correct?

          Show
          mmokhtar Mostafa Mokhtar added a comment - Dimitris Tsirogiannis This should help Kudu ETL correct?
          Hide
          dhecht Dan Hecht added a comment -

          Alexander Behm that's IMPALA-2523 which is a subtask of the same parent.

          Show
          dhecht Dan Hecht added a comment - Alexander Behm that's IMPALA-2523 which is a subtask of the same parent.
          Hide
          lv Lars Volker added a comment -

          IMPALA-2521: Add clustered hint to insert statements

          This change introduces a clustered/noclustered hint for insert
          statements. Specifying this hint adds an additional sort node to the
          plan, just before the table sink. This has the effect that data will be
          clustered by its partition prior to writing partitions, which therefore
          can be written sequentially.

          Change-Id: I412153bd8435d792bd61dea268d7a3b884048f14
          Reviewed-on: http://gerrit.cloudera.org:8080/4745
          Reviewed-by: Alex Behm <alex.behm@cloudera.com>
          Tested-by: Internal Jenkins

          Show
          lv Lars Volker added a comment - IMPALA-2521 : Add clustered hint to insert statements This change introduces a clustered/noclustered hint for insert statements. Specifying this hint adds an additional sort node to the plan, just before the table sink. This has the effect that data will be clustered by its partition prior to writing partitions, which therefore can be written sequentially. Change-Id: I412153bd8435d792bd61dea268d7a3b884048f14 Reviewed-on: http://gerrit.cloudera.org:8080/4745 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins

            People

            • Assignee:
              lv Lars Volker
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development