Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4166

Introduce SORT BY clause in CREATE TABLE statement

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 2.2, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Catalog
    • Labels:

      Description

      This issue is intended as a usability improvement for IMPALA-4163 where the SORT BY columns can be specified directly in the table definition like this:

      CREATE TABLE t (day INT, hour INT)
      PARTITIONED BY (year INT, month INT)
      SORT BY (day, hour);
      

      The above table creation has the effect that all inserts into the table have an implicit "sortby(day,hour)" plan hint applied. See IMPALA-4163 for details on the hint.

      Just like with the "sortby" hint the SORT BY clause can only contain non-partition columns for HDFS tables and non-primary key columns for Kudu tables.

      This has the following benefits:

      • Users will not have to remember to put the sortby hint in all insert statements.
      • The SORT BY columns are a physical design choice, so it makes sense to store them as part of the table metadata.
      • This is a convenience feature. It has the same effect as the sortby() hint for INSERT statements, but doesn't require the user to remember to include the hint for every INSERT statement.

      Challenges:

      • The Hive Metastore has no SORT BY concept, so we'll need to store the information in the generic TBLPROPERTIES map.
      • No other engines (Hive, Spark) will understand this table property. That means that data written by those engines will require an explicit sorting hint (as far as that's available).

        Issue Links

          Activity

          Hide
          lv Lars Volker added a comment -

          HBase tables do not allow to use the SORT BY clause. S3 allows setting the option and we also allow inserts into tables on S3. Avro tables allow setting the option, but then we throw an error if you try to insert there.

          Show
          lv Lars Volker added a comment - HBase tables do not allow to use the SORT BY clause. S3 allows setting the option and we also allow inserts into tables on S3. Avro tables allow setting the option, but then we throw an error if you try to insert there.
          Hide
          jrussell John Russell added a comment -

          > Just like with the "sortby" hint the SORT BY clause can only contain non-partition columns for HDFS tables and non-primary key columns for Kudu tables.

          > ERROR: AnalysisException: SORT BY is not supported for Kudu tables.

          Judging by the above error, I presume I should say that SORT BY is not applicable for Kudu tables.

          What's the story for S3 tables, HBase tables, or file formats such as Avro that Impala can't write?

          Show
          jrussell John Russell added a comment - > Just like with the "sortby" hint the SORT BY clause can only contain non-partition columns for HDFS tables and non-primary key columns for Kudu tables. > ERROR: AnalysisException: SORT BY is not supported for Kudu tables. Judging by the above error, I presume I should say that SORT BY is not applicable for Kudu tables. What's the story for S3 tables, HBase tables, or file formats such as Avro that Impala can't write?
          Hide
          lv Lars Volker added a comment -

          IMPALA-4166: Add SORT BY sql clause

          This change adds support for adding SORT BY (...) clauses to CREATE
          TABLE and ALTER TABLE statements. Examples are:

          CREATE TABLE t (i INT, j INT, k INT) PARTITIONED BY (l INT) SORT BY (i, j);
          CREATE TABLE t SORT BY (int_col,id) LIKE u;
          CREATE TABLE t LIKE PARQUET '/foo' SORT BY (id,zip);

          ALTER TABLE t SORT BY (int_col,id);
          ALTER TABLE t SORT BY ();

          Sort columns can only be specified for Hdfs tables and effectiveness may
          vary based on storage type; for example TEXT tables will not see
          improved compression. The SORT BY clause must not contain clustering
          columns. The columns in the SORT BY clause are stored in the
          'sort.columns' table property and will result in an additional SORT node
          being added to the plan before the final table sink. Specifying sort
          columns also enables clustering during inserts, so the SORT node will
          contain all partitioning columns first, followed by the sort columns. We
          do this because sort columns add a SORT node to the plan and adding the
          clustering columns to the SORT node is cheap.

          Sort columns supersede the sortby() hint, which we will remove in a
          subsequent change (IMPALA-5144). Until then, it is possible to specify
          sort columns using both ways at the same time and the column lists
          will be concatenated.

          Change-Id: I08834f38a941786ab45a4381c2732d929a934f75
          Reviewed-on: http://gerrit.cloudera.org:8080/6495
          Reviewed-by: Lars Volker <lv@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          lv Lars Volker added a comment - IMPALA-4166 : Add SORT BY sql clause This change adds support for adding SORT BY (...) clauses to CREATE TABLE and ALTER TABLE statements. Examples are: CREATE TABLE t (i INT, j INT, k INT) PARTITIONED BY (l INT) SORT BY (i, j); CREATE TABLE t SORT BY (int_col,id) LIKE u; CREATE TABLE t LIKE PARQUET '/foo' SORT BY (id,zip); ALTER TABLE t SORT BY (int_col,id); ALTER TABLE t SORT BY (); Sort columns can only be specified for Hdfs tables and effectiveness may vary based on storage type; for example TEXT tables will not see improved compression. The SORT BY clause must not contain clustering columns. The columns in the SORT BY clause are stored in the 'sort.columns' table property and will result in an additional SORT node being added to the plan before the final table sink. Specifying sort columns also enables clustering during inserts, so the SORT node will contain all partitioning columns first, followed by the sort columns. We do this because sort columns add a SORT node to the plan and adding the clustering columns to the SORT node is cheap. Sort columns supersede the sortby() hint, which we will remove in a subsequent change ( IMPALA-5144 ). Until then, it is possible to specify sort columns using both ways at the same time and the column lists will be concatenated. Change-Id: I08834f38a941786ab45a4381c2732d929a934f75 Reviewed-on: http://gerrit.cloudera.org:8080/6495 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins
          Hide
          lv Lars Volker added a comment -

          Should we implement this as a follow-up to IMPALA-4163?

          Show
          lv Lars Volker added a comment - Should we implement this as a follow-up to IMPALA-4163 ?
          Hide
          alex.behm Alexander Behm added a comment -

          That's correct. Total ordering is not required. The goal is to improve min/max effectiveness as well as compression.

          Show
          alex.behm Alexander Behm added a comment - That's correct. Total ordering is not required. The goal is to improve min/max effectiveness as well as compression.
          Hide
          alan@cloudera.com Alan Choi added a comment -

          It appears that it's not so much about total order, but just to create enough order in each Parquet file to do min/max filtering.

          Show
          alan@cloudera.com Alan Choi added a comment - It appears that it's not so much about total order, but just to create enough order in each Parquet file to do min/max filtering.

            People

            • Assignee:
              lv Lars Volker
              Reporter:
              alex.behm Alexander Behm
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development