Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1654

Impala needs to support all operators in drop partitions (<, >, <>, !=, <=, >=) like hive does

    Details

      Description

      In hive we can do this
      ALTER TABLE foo DROP PARTITION(ds < 'date')

      And also have drop partitions in a range with all these operators.
      <, >, <>, !=, <=, >=

      https://issues.apache.org/jira/browse/HIVE-2908

      We should have the same feature in impala too.

        Issue Links

          Activity

          Hide
          amosbird Amos Bird added a comment -

          , Working on it.

          Show
          amosbird Amos Bird added a comment - , Working on it.
          Hide
          amosbird Amos Bird added a comment -

          Currently I've made PartitionSpec contain a set of binary predicates instead of KVs so that all statements related to partition could be extended to operate on multi partitions. I need to figure out how to evaluate those string binary predicates in the frontend.

          table impala1654 partitioned by j, k

          name type
          i int
          j string
          k long

          We can do:

          1. show files in impala1654 partition (j < "2", k = 3);
          2. alter table impala1654 partition (j < "2", k > 3) set cached in 'pool';
          // j can appear more than once
          3. alter table impala1654 partition (j < "2", j > "0", k <> 4) set uncached;
          // missing k means all values of k
          4. alter table impala1654 partition (j < "2") set fileformat textfile;
          5. alter table impala1654 partition (j < "2") set serdeproperties ( "key" = "value");
          6. alter table impala1654 partition (j < "2") set tblproperties ( "key" = "value");
          7. alter table impala1654 drop partition (j < "2");
          7. compute incremental stats impala1654 partition(j < "2");

          We need to verify some statements which are one partition only on analyzing phase. Currently only support equations.

          1. load data inpath '/path/from' into table impala1654 partition (j = "2", k = 3);
          2. alter table impala1654 add partition (j = "2", k = 3);
          3. alter table impala1654 partition (j = "2", k = 3) set location '/path/to';
          4. insert into impala1654 partition (j = "2", k = 3) values (1), (2), (3);

          Show
          amosbird Amos Bird added a comment - Currently I've made PartitionSpec contain a set of binary predicates instead of KVs so that all statements related to partition could be extended to operate on multi partitions. I need to figure out how to evaluate those string binary predicates in the frontend. table impala1654 partitioned by j, k name type i int j string k long We can do: 1. show files in impala1654 partition (j < "2", k = 3); 2. alter table impala1654 partition (j < "2", k > 3) set cached in 'pool'; // j can appear more than once 3. alter table impala1654 partition (j < "2", j > "0", k <> 4) set uncached; // missing k means all values of k 4. alter table impala1654 partition (j < "2") set fileformat textfile; 5. alter table impala1654 partition (j < "2") set serdeproperties ( "key" = "value"); 6. alter table impala1654 partition (j < "2") set tblproperties ( "key" = "value"); 7. alter table impala1654 drop partition (j < "2"); 7. compute incremental stats impala1654 partition(j < "2"); We need to verify some statements which are one partition only on analyzing phase. Currently only support equations. 1. load data inpath '/path/from' into table impala1654 partition (j = "2", k = 3); 2. alter table impala1654 add partition (j = "2", k = 3); 3. alter table impala1654 partition (j = "2", k = 3) set location '/path/to'; 4. insert into impala1654 partition (j = "2", k = 3) values (1), (2), (3);
          Hide
          jbapple Jim Apple added a comment -

          Amos, when this is done, could you add some notes on how a user can use this feature (i.e., syntax), and pass the issue's ownership to John Russell, who writes Impala docs?

          Show
          jbapple Jim Apple added a comment - Amos, when this is done, could you add some notes on how a user can use this feature (i.e., syntax), and pass the issue's ownership to John Russell, who writes Impala docs?
          Hide
          amosbird Amos Bird added a comment -

          The affected syntaxes are:

          ALTER TABLE name { ADD [IF NOT EXISTS] | DROP [IF EXISTS] } PARTITION (partition_spec) [PURGE]
          ALTER TABLE name [PARTITION (partition_spec)]
            SET { FILEFORMAT file_format
            | LOCATION 'hdfs_path_of_directory'
            | TBLPROPERTIES (table_properties)
            | SERDEPROPERTIES (serde_properties) }
          ALTER TABLE name [PARTITION (partition_spec)] SET { CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED }
          COMPUTE INCREMENTAL STATS [db_name.]table_name [PARTITION (partition_spec)]
          LOAD DATA INPATH 'hdfs_file_or_directory_path' [OVERWRITE] INTO TABLE tablename  [PARTITION (partcol1=val1, partcol2=val2 ...)]
          
          where 
          partition_spec ::= 
            partition_col=constant_value (verified for statements accepting one partition at a time)
            | partitition_exprs
          
          Show
          amosbird Amos Bird added a comment - The affected syntaxes are: ALTER TABLE name { ADD [IF NOT EXISTS] | DROP [IF EXISTS] } PARTITION (partition_spec) [PURGE] ALTER TABLE name [PARTITION (partition_spec)] SET { FILEFORMAT file_format | LOCATION 'hdfs_path_of_directory' | TBLPROPERTIES (table_properties) | SERDEPROPERTIES (serde_properties) } ALTER TABLE name [PARTITION (partition_spec)] SET { CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED } COMPUTE INCREMENTAL STATS [db_name.]table_name [PARTITION (partition_spec)] LOAD DATA INPATH 'hdfs_file_or_directory_path' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] where partition_spec ::= partition_col=constant_value (verified for statements accepting one partition at a time) | partitition_exprs
          Hide
          jrussell John Russell added a comment -

          I'm not seeing this syntax in what I use inside Cloudera for Impala 2.7 doc verification. Is it still in process or is it intended to be finished in 2.7?

          Show
          jrussell John Russell added a comment - I'm not seeing this syntax in what I use inside Cloudera for Impala 2.7 doc verification. Is it still in process or is it intended to be finished in 2.7?
          Hide
          alex.behm Alexander Behm added a comment -

          commit 628685ae749daa133c9b7a6184c5523136979974
          Author: Amos Bird <amosbird@gmail.com>
          Date: Wed Nov 25 06:20:15 2015 -0800

          IMPALA-1654: General partition exprs in DDL operations.

          This commit handles partition related DDL in a more general way. We can
          now use compound predicates to specify a list of partitions in
          statements like ALTER TABLE DROP PARTITION and COMPUTE INCREMENTAL
          STATS, etc. It will also make sure some statements only accept one
          partition at a time, such as PARTITION SET LOCATION and LOAD DATA. ALTER
          TABLE ADD PARTITION remains using the old PartitionKeyValue's logic.

          The changed partition related DDLs are as follows,

          Table: p (i int) partitioned by (j int, k string)
          Partitions:
          ----------------------------------------------------

          j k #Rows #Files Size Bytes Cached Cache Replication

          ----------------------------------------------------

          1 a -1 0 0B NOT CACHED NOT CACHED
          1 b -1 0 0B NOT CACHED NOT CACHED
          1 c -1 0 0B NOT CACHED NOT CACHED
          2 d -1 0 0B NOT CACHED NOT CACHED
          2 e -1 0 0B NOT CACHED NOT CACHED
          2 f -1 0 0B NOT CACHED NOT CACHED
          Total   -1 0 0B 0B  

          ----------------------------------------------------

          1. show files in p partition (j<2, k='a');
          2. alter table p partition (j<2, k in ("b","c") set cached in 'testPool';

          // j can appear more than once,
          3.1. alter table p partition (j<2, j>0, k<>"d") set uncached;
          // it is the same as
          3.2. alter table p partition (j<2 and j>0, not k="e") set uncached;
          // we can also do 'or'
          3.3. alter table p partition (j<2 or j>0, k like "%") set uncached;

          // missing 'k' matches all values of k
          4. alter table p partition (j<2) set fileformat textfile;
          5. alter table p partition (k rlike ".*") set serdeproperties ("k"="v");
          6. alter table p partition (j is not null) set tblproperties ("k"="v");
          7. alter table p drop partition (j<2);
          8. compute incremental stats p partition(j<2);

          The remaining old partition related DDLs are as follows,

          1. load data inpath '/path/from' into table p partition (j=2, k="d");
          2. alter table p add partition (j=2, k="g");
          3. alter table p partition (j=2, k="g") set location '/path/to';
          4. insert into p partition (j=2, k="g") values (1), (2), (3);

          General partition expressions or partially specified partition specs
          allows partition predicates to return empty partition set no matter
          'IF EXISTS' is specified.

          Examples:

          [localhost.localdomain:21000] >
          alter table p drop partition (j=2, k="f");
          Query: alter table p drop partition (j=2, k="f")
          -------------------------

          summary

          -------------------------

          Dropped 1 partition(s).

          -------------------------
          Fetched 1 row(s) in 0.78s
          [localhost.localdomain:21000] >
          alter table p drop partition (j=2, k<"f");
          Query: alter table p drop partition (j=2, k<"f")
          -------------------------

          summary

          -------------------------

          Dropped 2 partition(s).

          -------------------------
          Fetched 1 row(s) in 0.41s
          [localhost.localdomain:21000] >
          alter table p drop partition (k="a");
          Query: alter table p drop partition (k="a")
          -------------------------

          summary

          -------------------------

          Dropped 1 partition(s).

          -------------------------
          Fetched 1 row(s) in 0.25s
          [localhost.localdomain:21000] > show partitions p;
          Query: show partitions p
          ----------------------------------------------------

          j k #Rows #Files Size Bytes Cached Cache Replication

          ----------------------------------------------------

          1 b -1 0 0B NOT CACHED NOT CACHED
          1 c -1 0 0B NOT CACHED NOT CACHED
          Total   -1 0 0B 0B  

          ----------------------------------------------------
          Fetched 3 row(s) in 0.01s

          Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f

          Reviewed-on: http://gerrit.cloudera.org:8080/3942
          Reviewed-by: Alex Behm <alex.behm@cloudera.com>
          Tested-by: Internal Jenkins

          Show
          alex.behm Alexander Behm added a comment - commit 628685ae749daa133c9b7a6184c5523136979974 Author: Amos Bird <amosbird@gmail.com> Date: Wed Nov 25 06:20:15 2015 -0800 IMPALA-1654 : General partition exprs in DDL operations. This commit handles partition related DDL in a more general way. We can now use compound predicates to specify a list of partitions in statements like ALTER TABLE DROP PARTITION and COMPUTE INCREMENTAL STATS, etc. It will also make sure some statements only accept one partition at a time, such as PARTITION SET LOCATION and LOAD DATA. ALTER TABLE ADD PARTITION remains using the old PartitionKeyValue's logic. The changed partition related DDLs are as follows, Table: p (i int) partitioned by (j int, k string) Partitions: ------ - ----- ------ ---- ------------ ------------------ j k #Rows #Files Size Bytes Cached Cache Replication ------ - ----- ------ ---- ------------ ------------------ 1 a -1 0 0B NOT CACHED NOT CACHED 1 b -1 0 0B NOT CACHED NOT CACHED 1 c -1 0 0B NOT CACHED NOT CACHED 2 d -1 0 0B NOT CACHED NOT CACHED 2 e -1 0 0B NOT CACHED NOT CACHED 2 f -1 0 0B NOT CACHED NOT CACHED Total   -1 0 0B 0B   ------ - ----- ------ ---- ------------ ------------------ 1. show files in p partition (j<2, k='a'); 2. alter table p partition (j<2, k in ("b","c") set cached in 'testPool'; // j can appear more than once, 3.1. alter table p partition (j<2, j>0, k<>"d") set uncached; // it is the same as 3.2. alter table p partition (j<2 and j>0, not k="e") set uncached; // we can also do 'or' 3.3. alter table p partition (j<2 or j>0, k like "%") set uncached; // missing 'k' matches all values of k 4. alter table p partition (j<2) set fileformat textfile; 5. alter table p partition (k rlike ".*") set serdeproperties ("k"="v"); 6. alter table p partition (j is not null) set tblproperties ("k"="v"); 7. alter table p drop partition (j<2); 8. compute incremental stats p partition(j<2); The remaining old partition related DDLs are as follows, 1. load data inpath '/path/from' into table p partition (j=2, k="d"); 2. alter table p add partition (j=2, k="g"); 3. alter table p partition (j=2, k="g") set location '/path/to'; 4. insert into p partition (j=2, k="g") values (1), (2), (3); General partition expressions or partially specified partition specs allows partition predicates to return empty partition set no matter 'IF EXISTS' is specified. Examples: [localhost.localdomain:21000] > alter table p drop partition (j=2, k="f"); Query: alter table p drop partition (j=2, k="f") ------------------------- summary ------------------------- Dropped 1 partition(s). ------------------------- Fetched 1 row(s) in 0.78s [localhost.localdomain:21000] > alter table p drop partition (j=2, k<"f"); Query: alter table p drop partition (j=2, k<"f") ------------------------- summary ------------------------- Dropped 2 partition(s). ------------------------- Fetched 1 row(s) in 0.41s [localhost.localdomain:21000] > alter table p drop partition (k="a"); Query: alter table p drop partition (k="a") ------------------------- summary ------------------------- Dropped 1 partition(s). ------------------------- Fetched 1 row(s) in 0.25s [localhost.localdomain:21000] > show partitions p; Query: show partitions p ------ - ----- ------ ---- ------------ ------------------ j k #Rows #Files Size Bytes Cached Cache Replication ------ - ----- ------ ---- ------------ ------------------ 1 b -1 0 0B NOT CACHED NOT CACHED 1 c -1 0 0B NOT CACHED NOT CACHED Total   -1 0 0B 0B   ------ - ----- ------ ---- ------------ ------------------ Fetched 3 row(s) in 0.01s Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f Reviewed-on: http://gerrit.cloudera.org:8080/3942 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins
          Hide
          jbapple Jim Apple added a comment -

          John Russell, this will need user documentation for Impala 2.8. As a reminder, that release does not yet have a schedule or a release manager.

          Show
          jbapple Jim Apple added a comment - John Russell , this will need user documentation for Impala 2.8. As a reminder, that release does not yet have a schedule or a release manager.
          Hide
          jrussell John Russell added a comment -

          While trying this feature, I was pleasantly surprised to see that function calls worked in the expression:

          alter table int_partitions partition (x > extract(day from now())) set fileformat parquet;
          alter table int_partitions add partition (x = extract(day from now()));
          

          That could be very convenient to call TIMESTAMP-manipulating functions for tables partitions on date & time-related fields. Just checking that the function resolution is known and expected behavior.

          I confirmed that IS NULL and IS NOT NULL operators work as expected. Makes sense, I just remember that there are sometimes special cases around partitions where the key value is NULL.

          In cases where the desire is to affect all partitions in a table, is there a shortest / most recommended idiom to touch all existing partitions? That is,

          alter table foo set fileformat parquet; -- Doesn't touch any existing partitions, must use PARTITION clause to change those.
          alter table foo partition (x = 0 or x != 0) set fileformat parquet; -- Changes existing partitions but misses case where X is NULL.
          alter table foo partition (x is null or x is not null) set fileformat parquet; -- Changes all existing partitions. But is there a simpler or more intuitive equivalent?
          
          Show
          jrussell John Russell added a comment - While trying this feature, I was pleasantly surprised to see that function calls worked in the expression: alter table int_partitions partition (x > extract(day from now())) set fileformat parquet; alter table int_partitions add partition (x = extract(day from now())); That could be very convenient to call TIMESTAMP-manipulating functions for tables partitions on date & time-related fields. Just checking that the function resolution is known and expected behavior. I confirmed that IS NULL and IS NOT NULL operators work as expected. Makes sense, I just remember that there are sometimes special cases around partitions where the key value is NULL. In cases where the desire is to affect all partitions in a table, is there a shortest / most recommended idiom to touch all existing partitions? That is, alter table foo set fileformat parquet; -- Doesn't touch any existing partitions, must use PARTITION clause to change those. alter table foo partition (x = 0 or x != 0) set fileformat parquet; -- Changes existing partitions but misses case where X is NULL. alter table foo partition (x is null or x is not null ) set fileformat parquet; -- Changes all existing partitions. But is there a simpler or more intuitive equivalent?
          Hide
          alex.behm Alexander Behm added a comment -

          John Russell good point. There is no simpler way to select all partitions today. Do you think we should allow constant predicates like PARTITION(true) or PARTITION(false)? We initially felt that there was no real use case for this and most likely a user error, but selecting all partitions seems like a reasonable goal.

          Show
          alex.behm Alexander Behm added a comment - John Russell good point. There is no simpler way to select all partitions today. Do you think we should allow constant predicates like PARTITION(true) or PARTITION(false)? We initially felt that there was no real use case for this and most likely a user error, but selecting all partitions seems like a reasonable goal.
          Hide
          jrussell John Russell added a comment -

          Probably best to wait and see if we get any user feedback along those lines.

          Show
          jrussell John Russell added a comment - Probably best to wait and see if we get any user feedback along those lines.
          Hide
          yosinv Yosi added a comment - - edited

          HI All,
          About the condition partition, I was trying to execute the below but keep getting an error, is there a way to overcome that :
          alter table TABLE_NAME drop partition ( partition_key < cast(concat(from_unixtime(
          cast(
          cast('2017-03-03 18:45:00' as timestamp) - interval 2 day as bigint
          ), 'yyyyMMdd')
          ,'-','13785') as string)
          );

          please advise,
          Thanks

          Show
          yosinv Yosi added a comment - - edited HI All, About the condition partition, I was trying to execute the below but keep getting an error, is there a way to overcome that : alter table TABLE_NAME drop partition ( partition_key < cast(concat(from_unixtime( cast( cast('2017-03-03 18:45:00' as timestamp) - interval 2 day as bigint ), 'yyyyMMdd') ,'-','13785') as string) ); please advise, Thanks
          Hide
          jbapple Jim Apple added a comment -

          Yosi, JIRA is not the optimal forum for Q & A. We use it for tracking bugs and feature requests.

          For Q & A, please use user@impala.apache.org.

          Show
          jbapple Jim Apple added a comment - Yosi , JIRA is not the optimal forum for Q & A. We use it for tracking bugs and feature requests. For Q & A, please use user@impala.apache.org.

            People

            • Assignee:
              jrussell John Russell
              Reporter:
              mala_ck Mala Chikka Kempanna
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development