Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.6.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We should allow queries like this into a partitioned table:

      CREATE TABLE (a STRING, b STRING, c STRING)
      PARTITIONED BY (ds STRING, ts STRING);
      
      INSERT OVERWRITE TABLE x PARTITION (ds = '2009-12-12')
      SELECT a, b, c, ts FROM xxx;
      

      Basically, allowing users to overwrite multiple partitions at a time.
      The partition values specified in PARTITION part (if any) should be a prefix of the partition keys.
      The rest of the partition keys goes to the end of the SELECT expression list.

      1. ASF.LICENSE.NOT.GRANTED--HIVE-1002.2.patch
        1.22 MB
        Ning Zhang
      2. ASF.LICENSE.NOT.GRANTED--HIVE-1002.4.patch
        1.26 MB
        Ning Zhang
      3. HIVE-1002.1.patch
        1.07 MB
        Ning Zhang
      4. HIVE-1002.patch
        1.06 MB
        Ning Zhang

        Issue Links

          Activity

          Hide
          Ning Zhang added a comment -

          Zhen, is this a sub-task for JIRA-936 ?

          Show
          Ning Zhang added a comment - Zhen, is this a sub-task for JIRA-936 ?
          Hide
          Namit Jain added a comment -

          I think so - We can mark it a duplicate of 936

          Show
          Namit Jain added a comment - I think so - We can mark it a duplicate of 936
          Hide
          Raghotham Murthy added a comment -

          i think its probably ok to specify which columns to partition on. We should not have implied ordering/hierarchy of keys. It might be better if the user specifies the order.
          Something like,

          insert overwrite T partitioned on (col[=value], (col[=value])*)
          

          Then users can specify constant values or if no value is present, use the column value in the data. In addition, this will allow users to specify the order of partitioning as well.

          Show
          Raghotham Murthy added a comment - i think its probably ok to specify which columns to partition on. We should not have implied ordering/hierarchy of keys. It might be better if the user specifies the order. Something like, insert overwrite T partitioned on (col[=value], (col[=value])*) Then users can specify constant values or if no value is present, use the column value in the data. In addition, this will allow users to specify the order of partitioning as well.
          Hide
          Ning Zhang added a comment -

          I think it good to let the user specify the partition columns just like it is done currently. We will allow user to left some partition columns to be dynamic partition columns which means they don't need to give the value at compile time. Which partition a row is inserted is determined at runtime.

          However, one issue is that if the order of the partition columns in the DML are different from the their order in DDL, we should thrown an error if some static partition followed by a dynamic partition. For example

          insert overwrite table T partition (ds, hr=12) select ...
          

          should throw an error. The reason is that the order of the partition column determines the directory hierarchy (hr is a subdirectory of ds). This is determined at create table time. If we allow the above DML, we have to have a clear semantics: we should either change all ds partitions who has a subdirectory hr=12, or we should complete overwrite the table and use a different directory hierarchy (ds being a subdirectory of hr). The first solution is potentially very expensive and rarely seen in practice. The second solution is potentially dangerous since the user could accidentally entered the wrong order and the whole table got overwritten rather than some partition got updated. Also the second case has a workaround: the user could create another partitioned table with different partition column ordering and use the above DML to load data.

          Show
          Ning Zhang added a comment - I think it good to let the user specify the partition columns just like it is done currently. We will allow user to left some partition columns to be dynamic partition columns which means they don't need to give the value at compile time. Which partition a row is inserted is determined at runtime. However, one issue is that if the order of the partition columns in the DML are different from the their order in DDL, we should thrown an error if some static partition followed by a dynamic partition. For example insert overwrite table T partition (ds, hr=12) select ... should throw an error. The reason is that the order of the partition column determines the directory hierarchy (hr is a subdirectory of ds). This is determined at create table time. If we allow the above DML, we have to have a clear semantics: we should either change all ds partitions who has a subdirectory hr=12, or we should complete overwrite the table and use a different directory hierarchy (ds being a subdirectory of hr). The first solution is potentially very expensive and rarely seen in practice. The second solution is potentially dangerous since the user could accidentally entered the wrong order and the whole table got overwritten rather than some partition got updated. Also the second case has a workaround: the user could create another partitioned table with different partition column ordering and use the above DML to load data.
          Hide
          Ning Zhang added a comment -

          Uploading HIVE-1002.patch for inserting to dynamic partitions. Some examples are:

          INSERT OVERWRITE TABLE T PARTITION (ds='2010-03-31', hr) select key, value, hr from srcpart where ds is not null and hr is not null;

          FROM srcpart
          INSERT OVERWRITE TABLE T partition(ds='2010-03-31', hr) select key, value, hr where hr < 24
          INSERT OVERWRITE TABLE T partition(ds='2010-03-31', hr = 'invalid') select key, value, hr where hr >= 24;

          More examples can be found in the .q files in the patch.

          Show
          Ning Zhang added a comment - Uploading HIVE-1002 .patch for inserting to dynamic partitions. Some examples are: INSERT OVERWRITE TABLE T PARTITION (ds='2010-03-31', hr) select key, value, hr from srcpart where ds is not null and hr is not null; FROM srcpart INSERT OVERWRITE TABLE T partition(ds='2010-03-31', hr) select key, value, hr where hr < 24 INSERT OVERWRITE TABLE T partition(ds='2010-03-31', hr = 'invalid') select key, value, hr where hr >= 24; More examples can be found in the .q files in the patch.
          Hide
          Ning Zhang added a comment -

          uploading HIVE-1002.1.patch addressing some issues raised in internal review.

          Show
          Ning Zhang added a comment - uploading HIVE-1002 .1.patch addressing some issues raised in internal review.
          Hide
          Namit Jain added a comment -

          Still looking, some minor comments:

          1. Add a parameter to limit the number of partitions that can be inserted into. Add a test for the same.
          2. Add the parameters in hive-default.xml

          Show
          Namit Jain added a comment - Still looking, some minor comments: 1. Add a parameter to limit the number of partitions that can be inserted into. Add a test for the same. 2. Add the parameters in hive-default.xml
          Hide
          Ted Yu added a comment -

          In ObjectInspectorUtils.partialCopyToStandardObject(), the following line:
          if (j == numCols) {
          should be:
          if (j == startCol + numCols - 1) {

          Show
          Ted Yu added a comment - In ObjectInspectorUtils.partialCopyToStandardObject(), the following line: if (j == numCols) { should be: if (j == startCol + numCols - 1) {
          Hide
          Ning Zhang added a comment -

          Ted, thanks for taking the time for review. The original code should be correct since it won't increment j for every struct field, but only incremented once i reached startCol. So j is actually the counter of columns not the column index.

          Show
          Ning Zhang added a comment - Ted, thanks for taking the time for review. The original code should be correct since it won't increment j for every struct field, but only incremented once i reached startCol. So j is actually the counter of columns not the column index.
          Hide
          Ning Zhang added a comment -

          Uploading a new patch HIVE-1002.2.patch. Additional changes include:

          1) introduce 2 hive parameters: hive.exec.max.dynamic.partitions=1000 (the total number of DP can be created) and hive.exec.max.dynamic.partitions.pernode=100 (max # of DP can be created per mapper/reducer). They are introduced primarily to reduce the load to the cluster. A fatal error is raised if hive.exec.max.dynamic.partitions.pernode is reached and the job is killed. The hive.exec.max.dynamic.partitions is checked after job completed before MoveTask moves the intermediate results to the final location.

          2) introduce hive.exec.default.partition.name=_HIVE_DEFAULT_PARTITION_. This is used to put tuples whose DP column value is NULL/empty string. For values containing special characters such as '%', ':', '/', we escape it using '%' followed by 2 hex code. Unicode larger than 127 will not be escaped.

          Show
          Ning Zhang added a comment - Uploading a new patch HIVE-1002 .2.patch. Additional changes include: 1) introduce 2 hive parameters: hive.exec.max.dynamic.partitions=1000 (the total number of DP can be created) and hive.exec.max.dynamic.partitions.pernode=100 (max # of DP can be created per mapper/reducer). They are introduced primarily to reduce the load to the cluster. A fatal error is raised if hive.exec.max.dynamic.partitions.pernode is reached and the job is killed. The hive.exec.max.dynamic.partitions is checked after job completed before MoveTask moves the intermediate results to the final location. 2) introduce hive.exec.default.partition.name=_ HIVE_DEFAULT_PARTITION _. This is used to put tuples whose DP column value is NULL/empty string. For values containing special characters such as '%', ':', '/', we escape it using '%' followed by 2 hex code. Unicode larger than 127 will not be escaped.
          Hide
          Namit Jain added a comment -

          I will take a look

          Show
          Namit Jain added a comment - I will take a look
          Hide
          Namit Jain added a comment -

          Some minor comments:

          1. What happens if hive.merge.mapfiles or hive.merge.mapredfiles is set to true ? Do we throw an error, or just ignore that ?
          Can we add a test ?
          2. Can you drop the tables that you are creating in the positive tests ?

          Show
          Namit Jain added a comment - Some minor comments: 1. What happens if hive.merge.mapfiles or hive.merge.mapredfiles is set to true ? Do we throw an error, or just ignore that ? Can we add a test ? 2. Can you drop the tables that you are creating in the positive tests ?
          Hide
          Namit Jain added a comment -

          2) introduce hive.exec.default.partition.name=HIVE_DEFAULT_PARTITION. This is used to put tuples whose DP column value is NULL/empty string. For values containing special characters such as '%', ':', '/', we escape it using '%' followed by 2 hex code. Unicode larger than 127 will not be escaped.

          Can you add a test for above ?

          Show
          Namit Jain added a comment - 2) introduce hive.exec.default.partition.name= HIVE_DEFAULT_PARTITION . This is used to put tuples whose DP column value is NULL/empty string. For values containing special characters such as '%', ':', '/', we escape it using '%' followed by 2 hex code. Unicode larger than 127 will not be escaped. Can you add a test for above ?
          Hide
          Namit Jain added a comment -

          Otherwise, it looks good - please do the minor changes above and I will commit if the tests pass

          Show
          Namit Jain added a comment - Otherwise, it looks good - please do the minor changes above and I will commit if the tests pass
          Hide
          Ning Zhang added a comment -

          1) hive.merge.mapfiles/mapredfiles will be set to false internally if dynamic partitioning is enabled. I have modified some of the tests to remove the explicit setting of hive.merge.* to false. I'll file a followup JIRA to address merging.

          2) drop tables are added to the positive tests.

          3) load_dyn_part14.q was added to positive tests for HIVE_DEFAULT_PARTITION. I didn't figure out how to enter unicode in hive cli ('\u8888' seems not work. I'll file a following up JiRA for this and another negative test case where exception was thrown after job finished (TestNegativeCliDriver doesn't work for this case either).

          Will upload a new patch addressing the above comments.

          Show
          Ning Zhang added a comment - 1) hive.merge.mapfiles/mapredfiles will be set to false internally if dynamic partitioning is enabled. I have modified some of the tests to remove the explicit setting of hive.merge.* to false. I'll file a followup JIRA to address merging. 2) drop tables are added to the positive tests. 3) load_dyn_part14.q was added to positive tests for HIVE_DEFAULT_PARTITION. I didn't figure out how to enter unicode in hive cli ('\u8888' seems not work. I'll file a following up JiRA for this and another negative test case where exception was thrown after job finished (TestNegativeCliDriver doesn't work for this case either). Will upload a new patch addressing the above comments.
          Hide
          Ning Zhang added a comment -

          Uploading a new patch.

          Show
          Ning Zhang added a comment - Uploading a new patch.
          Hide
          Namit Jain added a comment -

          will take a look

          Show
          Namit Jain added a comment - will take a look
          Hide
          Namit Jain added a comment -

          Committed. Thanks Ning

          Show
          Namit Jain added a comment - Committed. Thanks Ning
          Hide
          Ning Zhang added a comment -

          Wiki has been updated for this feature's syntax and semantics:

          http://wiki.apache.org/hadoop/Hive/Tutorial#Dynamic-partition_Insert

          Show
          Ning Zhang added a comment - Wiki has been updated for this feature's syntax and semantics: http://wiki.apache.org/hadoop/Hive/Tutorial#Dynamic-partition_Insert

            People

            • Assignee:
              Ning Zhang
              Reporter:
              Zheng Shao
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development