Hive
  1. Hive
  2. HIVE-1941

support explicit view partitioning

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.8.0
    • Component/s: Query Processor, Views
    • Labels:
      None

      Description

      Allow creation of a view with an explicit partitioning definition, and support ALTER VIEW ADD/DROP PARTITION for instantiating partitions.

      For more information, see

      http://wiki.apache.org/hadoop/Hive/PartitionedViews

      1. HIVE-1941.1.patch
        57 kB
        John Sichi
      2. HIVE-1941.2.patch
        59 kB
        John Sichi
      3. HIVE-1941.3.patch
        64 kB
        John Sichi
      4. HIVE-1941.4.patch
        67 kB
        John Sichi
      5. HIVE-1941.5.patch
        68 kB
        John Sichi

        Issue Links

          Activity

          Hide
          Paul Yang added a comment -

          Committed. Thanks John!

          Show
          Paul Yang added a comment - Committed. Thanks John!
          Hide
          Paul Yang added a comment -

          +1 tests passed

          Show
          Paul Yang added a comment - +1 tests passed
          Hide
          John Sichi added a comment -

          Ready for commit.

          Show
          John Sichi added a comment - Ready for commit.
          Hide
          John Sichi added a comment -

          HIVE-1941.5 adds metastore-level checks and fixes conflicts with latest trunk.

          I'm rerunning tests now.

          Show
          John Sichi added a comment - HIVE-1941 .5 adds metastore-level checks and fixes conflicts with latest trunk. I'm rerunning tests now.
          Hide
          Paul Yang added a comment -

          Patch looks good once we have the aforementioned changes.

          Show
          Paul Yang added a comment - Patch looks good once we have the aforementioned changes.
          Hide
          Paul Yang added a comment -

          @John - Yes, that's what I meant. I'll take a look at the whole patch as well.

          Show
          Paul Yang added a comment - @John - Yes, that's what I meant. I'll take a look at the whole patch as well.
          Hide
          He Yongqiang added a comment -

          Paul, can you help review the whole patch? I am still in middle of sth else...

          Show
          He Yongqiang added a comment - Paul, can you help review the whole patch? I am still in middle of sth else...
          Hide
          John Sichi added a comment -

          Paul, just to confirm: alter_view_failure4.q already verifies that we can't set a location from the QL level, but you're saying we should add a metastore-level check too, right?

          Show
          John Sichi added a comment - Paul, just to confirm: alter_view_failure4.q already verifies that we can't set a location from the QL level, but you're saying we should add a metastore-level check too, right?
          Hide
          Paul Yang added a comment -

          Similarly, we should handle the case when calling append_partition() on a view.

          Show
          Paul Yang added a comment - Similarly, we should handle the case when calling append_partition() on a view.
          Hide
          Paul Yang added a comment -

          It looks like it's possible with the current thrift add_partition() method to create a partition for a view with a non-null SD/location. Can we put in a check to guard against this case?

          Other than that, it looks good from the metastore/replication side.

          Show
          Paul Yang added a comment - It looks like it's possible with the current thrift add_partition() method to create a partition for a view with a non-null SD/location. Can we put in a check to guard against this case? Other than that, it looks good from the metastore/replication side.
          Hide
          He Yongqiang added a comment -

          sorry, i was in the middle of other things. will go through the code again.

          Show
          He Yongqiang added a comment - sorry, i was in the middle of other things. will go through the code again.
          Hide
          John Sichi added a comment -

          Yongqiang, does that address your comments?

          Show
          John Sichi added a comment - Yongqiang, does that address your comments?
          Hide
          John Sichi added a comment -

          Opened followup HIVE-1989 for transitivity on predicates over join keys.

          Show
          John Sichi added a comment - Opened followup HIVE-1989 for transitivity on predicates over join keys.
          Hide
          John Sichi added a comment -

          A use case for a partitioned view on top of an unpartitioned table would be when the table is an external table (being loaded by some other mechanism) or a non-native table (e.g. HBase). I think both of these are likely to come up eventually, so we might as well allow them from day one. The view partition would be used to indicate that some particular portion of the dataset has been fully loaded.

          I verified that no join case currently works in strict mode because the predicate does not get pushed down both ways. I will open a followup for that (it's really something we should be optimizing regardless of views instead of forcing the user to duplicate the predicate).

          If you look at my negative test cases, notice that in strict mode we prevent the case where the underlying table partitioning is not satisfied by the view. However, the exception is raised during ADD PARTITION (rather than during CREATE VIEW) because that's when we compile the dummy query.

          Regarding prefixes: as Raghu mentioned for indexing, a prefix would only be relevant if we move to a hierarchical model for partitioning, which we don't currently have.

          Regarding dependencies, we should work on that after HIVE-1073 is in place to track them (we need them in the metastore).

          Show
          John Sichi added a comment - A use case for a partitioned view on top of an unpartitioned table would be when the table is an external table (being loaded by some other mechanism) or a non-native table (e.g. HBase). I think both of these are likely to come up eventually, so we might as well allow them from day one. The view partition would be used to indicate that some particular portion of the dataset has been fully loaded. I verified that no join case currently works in strict mode because the predicate does not get pushed down both ways. I will open a followup for that (it's really something we should be optimizing regardless of views instead of forcing the user to duplicate the predicate). If you look at my negative test cases, notice that in strict mode we prevent the case where the underlying table partitioning is not satisfied by the view. However, the exception is raised during ADD PARTITION (rather than during CREATE VIEW) because that's when we compile the dummy query. Regarding prefixes: as Raghu mentioned for indexing, a prefix would only be relevant if we move to a hierarchical model for partitioning, which we don't currently have. Regarding dependencies, we should work on that after HIVE-1073 is in place to track them (we need them in the metastore).
          Hide
          He Yongqiang added a comment -

          How about the question raised in our internal review:
          the relation between the view's partition key and the underlying tables' partition keys?

          For example:

          Is there a use case to support a view def like:

          CREATE VIEW vp1
          PARTITIONED ON (value)
          AS
          SELECT key, value
          FROM src
          WHERE key=86;

          Because the view is partitioned on value, which is not the partition key of the underlying table. So the Partition on the view seems not very useful.

          But if in an example:

          CREATE VIEW vp2
          PARTITIONED ON (ds)
          AS
          SELECT src.key, srcpart.value, srcpart.ds
          FROM src
          join srcpart
          on src.key = srcpart.key
          ;
          The vp2's ds originate from srcpart. So if there is a partition ds='2011' existing in srcpart, it makes sense to add a partition ds='2011' in vp2.

          Another example:

          create table srcpart_1 (key int, value string) partitioned by (ds string);
          create table srcpart_2 (key int, value string) partitioned by (ds string, hr int);

          CREATE VIEW vp2
          PARTITIONED ON (ds, hr)
          AS
          SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds, srcpart_2.hr
          FROM srcpart_1
          join srcpart_2
          on srcpart_1.key = srcpart_2.key
          and srcpart_1.ds=srcpart_2.ds
          ;

          From this case the ds originate from srcpart_1, but the join condition put the same ds on srcpart_2. So a query like "select * from vp2 where ds='2000' and hr=11" is very smooth to run.

          Some negative examples:

          create table srcpart_1 (key int, value string) partitioned by (ds string);
          create table srcpart_2 (key int, value string) partitioned by (ds string, hr int);

          CREATE VIEW vp2
          PARTITIONED ON (ds, hr)
          AS
          SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds, srcpart_2.hr
          FROM srcpart_1
          join srcpart_2
          on srcpart_1.key = srcpart_2.key
          ;
          We should not support this because it is hard to a partition pruning.

          Another one:
          CREATE VIEW vp2
          PARTITIONED ON (ds, hr)
          AS
          SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds, srcpart_2.hr
          FROM srcpart_1
          join srcpart_2
          on srcpart_1.key = srcpart_2.key
          and srcpart_1.ds=srcpart_2.hr
          ;

          we should probably also throw an error this definition.

          I am not sure. But i think it makes sense to include the view's partition keys are from the underlying tables. Even not, we should make sure a prefix of the partition keys are from underlying tables.
          And for partition keys from base tables, we should also make sure they cover a prefix of the base tables' partition keys (not randomly chosen).

          A big problem of this implicit partition relationship between a partitioned view and the base tables is how to manage the relationships to keep consistent. For example, what if the base table's partition got dropped and the view's partition is still there? I think we should be fine here.

          Show
          He Yongqiang added a comment - How about the question raised in our internal review: the relation between the view's partition key and the underlying tables' partition keys? For example: Is there a use case to support a view def like: CREATE VIEW vp1 PARTITIONED ON (value) AS SELECT key, value FROM src WHERE key=86; Because the view is partitioned on value, which is not the partition key of the underlying table. So the Partition on the view seems not very useful. But if in an example: CREATE VIEW vp2 PARTITIONED ON (ds) AS SELECT src.key, srcpart.value, srcpart.ds FROM src join srcpart on src.key = srcpart.key ; The vp2's ds originate from srcpart. So if there is a partition ds='2011' existing in srcpart, it makes sense to add a partition ds='2011' in vp2. Another example: create table srcpart_1 (key int, value string) partitioned by (ds string); create table srcpart_2 (key int, value string) partitioned by (ds string, hr int); CREATE VIEW vp2 PARTITIONED ON (ds, hr) AS SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds, srcpart_2.hr FROM srcpart_1 join srcpart_2 on srcpart_1.key = srcpart_2.key and srcpart_1.ds=srcpart_2.ds ; From this case the ds originate from srcpart_1, but the join condition put the same ds on srcpart_2. So a query like "select * from vp2 where ds='2000' and hr=11" is very smooth to run. Some negative examples: create table srcpart_1 (key int, value string) partitioned by (ds string); create table srcpart_2 (key int, value string) partitioned by (ds string, hr int); CREATE VIEW vp2 PARTITIONED ON (ds, hr) AS SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds, srcpart_2.hr FROM srcpart_1 join srcpart_2 on srcpart_1.key = srcpart_2.key ; We should not support this because it is hard to a partition pruning. Another one: CREATE VIEW vp2 PARTITIONED ON (ds, hr) AS SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds, srcpart_2.hr FROM srcpart_1 join srcpart_2 on srcpart_1.key = srcpart_2.key and srcpart_1.ds=srcpart_2.hr ; we should probably also throw an error this definition. I am not sure. But i think it makes sense to include the view's partition keys are from the underlying tables. Even not, we should make sure a prefix of the partition keys are from underlying tables. And for partition keys from base tables, we should also make sure they cover a prefix of the base tables' partition keys (not randomly chosen). A big problem of this implicit partition relationship between a partitioned view and the base tables is how to manage the relationships to keep consistent. For example, what if the base table's partition got dropped and the view's partition is still there? I think we should be fine here.
          Hide
          He Yongqiang added a comment -

          will take a look.

          Show
          He Yongqiang added a comment - will take a look.
          Hide
          John Sichi added a comment -

          HIVE-1941.4.patch passes tests. Review board request updated here:

          https://reviews.apache.org/r/390/

          Show
          John Sichi added a comment - HIVE-1941 .4.patch passes tests. Review board request updated here: https://reviews.apache.org/r/390/
          Hide
          John Sichi added a comment -

          HIVE-1941.4.patch fixes a few test failures in the previous patch; rerunning tests now.

          Show
          John Sichi added a comment - HIVE-1941 .4.patch fixes a few test failures in the previous patch; rerunning tests now.
          Hide
          John Sichi added a comment -

          Running HIVE-1941.3.patch through tests now.

          Show
          John Sichi added a comment - Running HIVE-1941 .3.patch through tests now.
          Hide
          John Sichi added a comment -

          Chatted with Paul: hook should get view+view partitions as output, underlying table+table partitions as input.

          Show
          John Sichi added a comment - Chatted with Paul: hook should get view+view partitions as output, underlying table+table partitions as input.
          Hide
          John Sichi added a comment -

          Improvement needed from code review session at Facebook:

          • ALTER VIEW ADD/DROP PARTITION needs to derive dependencies on underlying table partitions in order for replication to work (report them via hook input/outputs)

          Followups needed:

          • Store partition-level dependencies in metastore; use these later for stuff like retrieving stats from underlying table partitions
          • CREATE OR REPLACE VIEW to avoid having to drop all partitions during schema evolution
          • For a join view with join on partition column, optimizer should be smart enough to push predicate down both ways in order to avoid errors in strict mode (investigate whether this is already happening)
          Show
          John Sichi added a comment - Improvement needed from code review session at Facebook: ALTER VIEW ADD/DROP PARTITION needs to derive dependencies on underlying table partitions in order for replication to work (report them via hook input/outputs) Followups needed: Store partition-level dependencies in metastore; use these later for stuff like retrieving stats from underlying table partitions CREATE OR REPLACE VIEW to avoid having to drop all partitions during schema evolution For a join view with join on partition column, optimizer should be smart enough to push predicate down both ways in order to avoid errors in strict mode (investigate whether this is already happening)
          Hide
          John Sichi added a comment -
          Show
          John Sichi added a comment - https://reviews.apache.org/r/390/
          Hide
          John Sichi added a comment -

          HIVE-1941.2.patch should pass tests.

          Show
          John Sichi added a comment - HIVE-1941 .2.patch should pass tests.
          Hide
          John Sichi added a comment -

          HIVE-1941.1.patch is preliminary; I haven't run through tests yet.

          Show
          John Sichi added a comment - HIVE-1941 .1.patch is preliminary; I haven't run through tests yet.
          Hide
          John Sichi added a comment -

          This is the manual approach mentioned in HIVE-1079 comments.

          Show
          John Sichi added a comment - This is the manual approach mentioned in HIVE-1079 comments.

            People

            • Assignee:
              John Sichi
              Reporter:
              John Sichi
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development