Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6.0
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Hive currently does not support views.
      It would be a very nice feature to have.

      1. HIVE-972.3.patch
        207 kB
        John Sichi
      2. HIVE-972.2.patch
        148 kB
        John Sichi
      3. HIVE-972.1.patch
        120 kB
        John Sichi

        Issue Links

          Activity

          Hide
          Jeff Hammerbacher added a comment -

          What sort of views? HIVE-29 discusses the "virtual" views used in SCOPE that I would find quite valuable.

          In previous discussions with Ashish about priorities, he mentioned that statistics collection (e.g. HIVE-33) was more feasible and important than materialized views. Has there been a change in priority, or is this ticket just a place to gather discussions?

          Thanks,
          Jeff

          Show
          Jeff Hammerbacher added a comment - What sort of views? HIVE-29 discusses the "virtual" views used in SCOPE that I would find quite valuable. In previous discussions with Ashish about priorities, he mentioned that statistics collection (e.g. HIVE-33 ) was more feasible and important than materialized views. Has there been a change in priority, or is this ticket just a place to gather discussions? Thanks, Jeff
          Hide
          Namit Jain added a comment -

          These are standard sql views, which is similar to https://issues.apache.org/jira/browse/HIVE-29.
          Most probably, both of them will end up sharing the same underlying implementation.

          The view will not be materialized. That might be added later.

          Show
          Namit Jain added a comment - These are standard sql views, which is similar to https://issues.apache.org/jira/browse/HIVE-29 . Most probably, both of them will end up sharing the same underlying implementation. The view will not be materialized. That might be added later.
          Hide
          John Sichi added a comment -

          Wiki page for planning out this feature is here; comments welcome:

          http://wiki.apache.org/hadoop/Hive/ViewDev

          Show
          John Sichi added a comment - Wiki page for planning out this feature is here; comments welcome: http://wiki.apache.org/hadoop/Hive/ViewDev
          Hide
          Ashish Thusoo added a comment -

          Pretty comprehensive writeup Here are my comments:

          1. It may be better to just go with a flat model to keep things simple. Also whenever we do materialized views in future you do have an object that is part table and part view and you may just need the flat model anyway at that point. The primary reason though to go with the flat model would be simplicity and less severe schema migration of the metastore schema.

          2. For dependency tracking there is already code in hive that uses pre execution hooks to track lineage. That could easily be used to extract view dependencies (table level dependencies) when you create the view metadata. Raghu also had done some work on column lineage and perhaps that can be used to capture column lineage.

          I think for the first cut we should just go with table dependencies and leave column stuff for later. We should have the lenient dependency invalidation scheme (perhaps for both drops and alters) because at least that way users can inspect view definitions and then fix them later. Accordingly then we would need a flag to mark an invalidated view and maybe some way of looking at that list. I think we can punt the cascade option for now as that seems to be an optimization in the user workflow and could be added later. Thoughts? The restrict though is probably more useful. We could have that be the default in the strict mode (Hive has a strict mode which disallows queries on partitioned tables in case a where clause on the partition column was not specified),

          Not sure on what we should do about temporary functions but if we use views to transform our internal logs to another schema (nectar imps -> context) then we may need it.

          3. I am not sure if supporting limit is important but I can see good use of order by when we do materialized views. The sorted property could be helpful there and would be good to capture. We already capture those for tables.

          4. I think fast path should work seemlessly, once the fast path with filters is done, no?

          5. I think we can punt view modification for now if we support ways of inspecting the view sql for folks.

          Show
          Ashish Thusoo added a comment - Pretty comprehensive writeup Here are my comments: 1. It may be better to just go with a flat model to keep things simple. Also whenever we do materialized views in future you do have an object that is part table and part view and you may just need the flat model anyway at that point. The primary reason though to go with the flat model would be simplicity and less severe schema migration of the metastore schema. 2. For dependency tracking there is already code in hive that uses pre execution hooks to track lineage. That could easily be used to extract view dependencies (table level dependencies) when you create the view metadata. Raghu also had done some work on column lineage and perhaps that can be used to capture column lineage. I think for the first cut we should just go with table dependencies and leave column stuff for later. We should have the lenient dependency invalidation scheme (perhaps for both drops and alters) because at least that way users can inspect view definitions and then fix them later. Accordingly then we would need a flag to mark an invalidated view and maybe some way of looking at that list. I think we can punt the cascade option for now as that seems to be an optimization in the user workflow and could be added later. Thoughts? The restrict though is probably more useful. We could have that be the default in the strict mode (Hive has a strict mode which disallows queries on partitioned tables in case a where clause on the partition column was not specified), Not sure on what we should do about temporary functions but if we use views to transform our internal logs to another schema (nectar imps -> context) then we may need it. 3. I am not sure if supporting limit is important but I can see good use of order by when we do materialized views. The sorted property could be helpful there and would be good to capture. We already capture those for tables. 4. I think fast path should work seemlessly, once the fast path with filters is done, no? 5. I think we can punt view modification for now if we support ways of inspecting the view sql for folks.
          Hide
          Prasad Chakka added a comment -

          .. continuing discussion from hive-users@

          1.
          regarding metadata:
          Are you saying that If a view is created on a partitioned table, it would inherit the base table's partition columns as regular columns? (this will be a common usecase when inserts into multiple partitions is supported) Even if we are not supporting partitions right now, we need to think of the way we would model metadata for partitioned views (not necessarily materialized views) so that the changes made right now will not conflict too much with the future changes.

          I like the flat model as the inheritance model is too much of a overkill and not really suited in this case. But I would rather put the view-def into 'StorageDescriptor' class. This would work well when partitioned views are supported since 'Partition' class will inherit the view definition and the future view schema can evolve while freezing the schema of partitions created earlier.

          The type of the 'view-def' column has to be a CLOB but I am not sure of performance implications of having a CLOB column in a very large table. If it is not advisable then it may be useful to put this CLOB (or view metadata in a separate class and link to StorageDescriptor)

          2.
          > In SQL:200n, a view definition is supposed to be frozen at the time it is created, so that if the view is defined as select * from t, where t is a table with two columns a and b, then later requests to select * from the view should return just columns a and b, even if a new column c is later added to the table. This is implemented correctly by most DBMS products.

          Do you know the reasoning behind this? This would make changing the base table schema very hard and if I am not mistaken Facebook's base table schemas change (mostly addition of new columns) more than ocasionally and it will be an administrative nightmare to change all the dependent views. IMO, '' should represent all columns and if a view creator does not want to inherit changes to base table schema then she can specify the exact columns instead of ''.

          3.
          I like DependencyParticipant idea. We could use simple inheritance strategy where the dependent table contains (table/view id, used_name_of_dependent_obj, obj_id, obj_type)

          4. +1 to Leniant Dependency Invalidation but there should be a command to preview the invalid views.

          Show
          Prasad Chakka added a comment - .. continuing discussion from hive-users@ 1. regarding metadata: Are you saying that If a view is created on a partitioned table, it would inherit the base table's partition columns as regular columns? (this will be a common usecase when inserts into multiple partitions is supported) Even if we are not supporting partitions right now, we need to think of the way we would model metadata for partitioned views (not necessarily materialized views) so that the changes made right now will not conflict too much with the future changes. I like the flat model as the inheritance model is too much of a overkill and not really suited in this case. But I would rather put the view-def into 'StorageDescriptor' class. This would work well when partitioned views are supported since 'Partition' class will inherit the view definition and the future view schema can evolve while freezing the schema of partitions created earlier. The type of the 'view-def' column has to be a CLOB but I am not sure of performance implications of having a CLOB column in a very large table. If it is not advisable then it may be useful to put this CLOB (or view metadata in a separate class and link to StorageDescriptor) 2. > In SQL:200n, a view definition is supposed to be frozen at the time it is created, so that if the view is defined as select * from t, where t is a table with two columns a and b, then later requests to select * from the view should return just columns a and b, even if a new column c is later added to the table. This is implemented correctly by most DBMS products. Do you know the reasoning behind this? This would make changing the base table schema very hard and if I am not mistaken Facebook's base table schemas change (mostly addition of new columns) more than ocasionally and it will be an administrative nightmare to change all the dependent views. IMO, ' ' should represent all columns and if a view creator does not want to inherit changes to base table schema then she can specify the exact columns instead of ' '. 3. I like DependencyParticipant idea. We could use simple inheritance strategy where the dependent table contains (table/view id, used_name_of_dependent_obj, obj_id, obj_type) 4. +1 to Leniant Dependency Invalidation but there should be a command to preview the invalid views.
          Hide
          Prasad Chakka added a comment -

          Sorry, the bold formatting was unintentional and a side effect of saying '*'. It should have read this way

          IMO, '*' should represent all columns and if a view creator does not want to inherit changes to base table schema then she can specify the exact columns instead of '*'.

          Show
          Prasad Chakka added a comment - Sorry, the bold formatting was unintentional and a side effect of saying '*'. It should have read this way IMO, '*' should represent all columns and if a view creator does not want to inherit changes to base table schema then she can specify the exact columns instead of '*'.
          Hide
          John Sichi added a comment -

          Prasad, regarding select *: the static expansion semantics are standard and work the same way in all of the major DBMS's (Oracle, MySQL, SQL Server, etc). The reasoning is that views are often used as insulation layers, so that changes to the underlying schema are only exposed in a controlled fashion (i.e. an application may start tracking some new internal piece of info about a person which is not appropriate to expose via the views accessed by a UI. In systems that support GRANT, privileges are often granted on only a view, not the underlying table, so dynamically expanding a view could leak information. The counterarguments you make are certainly valid, but since the rules have already been set and agreed on by the big guys for a long time in this area, probably best to follow them to minimize surprises for users accustomed to the standard behavior.

          Show
          John Sichi added a comment - Prasad, regarding select *: the static expansion semantics are standard and work the same way in all of the major DBMS's (Oracle, MySQL, SQL Server, etc). The reasoning is that views are often used as insulation layers, so that changes to the underlying schema are only exposed in a controlled fashion (i.e. an application may start tracking some new internal piece of info about a person which is not appropriate to expose via the views accessed by a UI. In systems that support GRANT, privileges are often granted on only a view, not the underlying table, so dynamically expanding a view could leak information. The counterarguments you make are certainly valid, but since the rules have already been set and agreed on by the big guys for a long time in this area, probably best to follow them to minimize surprises for users accustomed to the standard behavior.
          Hide
          Namit Jain added a comment -

          I think we can go with Lenient Dependency Invalidation - which means that the view gets invalid if any DDL happens on the underlying table.
          Column level dependencies can be ignored as of now. The view depends on persistent objects (other tables and views).

          We should support:

          alter view <viewname> compile;

          The views can be automatically compiled when referenced in a query, but may lead to unexpected results for select '*'. So, in the first cut,
          we can ask the user to explicitly validate the view.

          Other issue is regarding the error messages. If the query is equivalent to: select ... from (view-def-select);
          will the error positions be correct ?

          Show
          Namit Jain added a comment - I think we can go with Lenient Dependency Invalidation - which means that the view gets invalid if any DDL happens on the underlying table. Column level dependencies can be ignored as of now. The view depends on persistent objects (other tables and views). We should support: alter view <viewname> compile; The views can be automatically compiled when referenced in a query, but may lead to unexpected results for select '*'. So, in the first cut, we can ask the user to explicitly validate the view. Other issue is regarding the error messages. If the query is equivalent to: select ... from (view-def-select); will the error positions be correct ?
          Hide
          John Sichi added a comment -

          Prasad, regarding partitioned views, I'll talk to you about it when I'm in the office to make sure I understand the cases you're thinking of.

          Show
          John Sichi added a comment - Prasad, regarding partitioned views, I'll talk to you about it when I'm in the office to make sure I understand the cases you're thinking of.
          Hide
          Prasad Chakka added a comment -

          @john, i will be in office on all work days next week and tomorrow as well.

          Show
          Prasad Chakka added a comment - @john, i will be in office on all work days next week and tomorrow as well.
          Hide
          Prasad Chakka added a comment -

          mainly i don't understand how 'show partitions' will work on views, or whether it should work at all. if it is not supported then following issues will make it difficult to use views

          • 'show partitions' is a very useful function for downstream processes that are waiting on availability on a particular partition.
          • Partition keys are also useful metadata information for users browsing views
          • In a 'strict' mode where a predicate on a partition key is required for any partitioned table, a user has to know what column the predicate should be applied which then will be translated to base tables' partition columns.
          Show
          Prasad Chakka added a comment - mainly i don't understand how 'show partitions' will work on views, or whether it should work at all. if it is not supported then following issues will make it difficult to use views 'show partitions' is a very useful function for downstream processes that are waiting on availability on a particular partition. Partition keys are also useful metadata information for users browsing views In a 'strict' mode where a predicate on a partition key is required for any partitioned table, a user has to know what column the predicate should be applied which then will be translated to base tables' partition columns.
          Hide
          Namit Jain added a comment -

          If we are allowing partitioned views, show partitions should work.

          I am not sure but we need some syntax where the view partition columns can be specified at the time the view is being created.

          Show
          Namit Jain added a comment - If we are allowing partitioned views, show partitions should work. I am not sure but we need some syntax where the view partition columns can be specified at the time the view is being created.
          Hide
          John Sichi added a comment -

          Namit, Prasad, I'm starting to understand. I suppose we could try to derive partition columns from the underlying tables (it's a similar problem to deriving view updatability), but that could be difficult to get right in all cases, so allowing the user to specify it explicitly makes sense. Let's discuss on Monday.

          Show
          John Sichi added a comment - Namit, Prasad, I'm starting to understand. I suppose we could try to derive partition columns from the underlying tables (it's a similar problem to deriving view updatability), but that could be difficult to get right in all cases, so allowing the user to specify it explicitly makes sense. Let's discuss on Monday.
          Hide
          John Sichi added a comment -

          Namit, alter view <viewname> compile is a good solution to the select * problem, since assuming we save the original view definition untouched (in addition to the expanded one), we can re-expand the *'s at that time. So a user will only have to recompile a view by name to see the new columns (without having to know its definition).

          Regarding error messages, I'll think about what can be done. In general, AST rewrites make reporting parse positions difficult, but maybe we can do it with respect to the view subtree and provide that SQL in the error message as context.

          Show
          John Sichi added a comment - Namit, alter view <viewname> compile is a good solution to the select * problem, since assuming we save the original view definition untouched (in addition to the expanded one), we can re-expand the *'s at that time. So a user will only have to recompile a view by name to see the new columns (without having to know its definition). Regarding error messages, I'll think about what can be done. In general, AST rewrites make reporting parse positions difficult, but maybe we can do it with respect to the view subtree and provide that SQL in the error message as context.
          Hide
          John Sichi added a comment -

          I have updated the wiki page based on feedback from a design review meeting. Look for "Update 30-Dec-2009" in affected sections.

          Show
          John Sichi added a comment - I have updated the wiki page based on feedback from a design review meeting. Look for "Update 30-Dec-2009" in affected sections.
          Hide
          John Sichi added a comment -

          For view expansion, I've found a parser-based shortcut which I've described in an update to the wiki page under the "Stored View Definition" section. This allows me to avoid adding full-blown AST unparse support.

          AST unparse support is useful, but it adds a large maintenance burden to the parser, so we can defer that until there's a burning need for it.

          Show
          John Sichi added a comment - For view expansion, I've found a parser-based shortcut which I've described in an update to the wiki page under the "Stored View Definition" section. This allows me to avoid adding full-blown AST unparse support. AST unparse support is useful, but it adds a large maintenance burden to the parser, so we can defer that until there's a burning need for it.
          Hide
          John Sichi added a comment -

          For error message debuggability, I'm introducting an ASTNodeOrigin concept; when we expand a view's definition and splice it into a referencing query, we tag each node with information about its origin, so that if we encounter an error during semantic analysis, we can provide full context (similar to the .h inclusion chain in a C++ compiler error). It's a bit verbose, but without it, debugging could be difficult. Here's an example:

          CREATE TABLE table1 (key int, value int);
          CREATE VIEW view1 AS SELECT * FROM table1;
          CREATE VIEW view2 AS SELECT * FROM view1 xxx;
          ALTER TABLE table1 REPLACE COLUMNS (key int);
          SELECT * FROM view2 yyy;
          Error in semantic analysis: line 1:32 Invalid Column Reference `value` in definition of VIEW view1 [
          SELECT `table1`.`key`, `table1`.`value` FROM table1
          ] used as xxx at line 1:39 in definition of VIEW view2 [
          SELECT `xxx`.`key`, `xxx`.`value` FROM view1 xxx
          ] used as yyy at line 2:14

          Show
          John Sichi added a comment - For error message debuggability, I'm introducting an ASTNodeOrigin concept; when we expand a view's definition and splice it into a referencing query, we tag each node with information about its origin, so that if we encounter an error during semantic analysis, we can provide full context (similar to the .h inclusion chain in a C++ compiler error). It's a bit verbose, but without it, debugging could be difficult. Here's an example: CREATE TABLE table1 (key int, value int); CREATE VIEW view1 AS SELECT * FROM table1; CREATE VIEW view2 AS SELECT * FROM view1 xxx; ALTER TABLE table1 REPLACE COLUMNS (key int); SELECT * FROM view2 yyy; Error in semantic analysis: line 1:32 Invalid Column Reference `value` in definition of VIEW view1 [ SELECT `table1`.`key`, `table1`.`value` FROM table1 ] used as xxx at line 1:39 in definition of VIEW view2 [ SELECT `xxx`.`key`, `xxx`.`value` FROM view1 xxx ] used as yyy at line 2:14
          Hide
          John Sichi added a comment -

          Regarding referencing a temporary function from a view: I guess we could just allow it, and leave it up to the user to reregister the function in a new Hive instance before referencing the view. If they forget, they'll get an exception like the one above, at which point they can register the function and try again. Kind of clunky, but the ability to make use of UDF's in views may be valuable.

          Opinions?

          Show
          John Sichi added a comment - Regarding referencing a temporary function from a view: I guess we could just allow it, and leave it up to the user to reregister the function in a new Hive instance before referencing the view. If they forget, they'll get an exception like the one above, at which point they can register the function and try again. Kind of clunky, but the ability to make use of UDF's in views may be valuable. Opinions?
          Hide
          John Sichi added a comment -

          This is a work-in-progress patch to allow for preliminary review; it's not intended for commit.

          Show
          John Sichi added a comment - This is a work-in-progress patch to allow for preliminary review; it's not intended for commit.
          Hide
          Zheng Shao added a comment -

          Regarding referencing a temporary function from a view: I guess we could just allow it, and leave it up to the user to reregister the function in a new Hive instance before referencing the view. If they forget, they'll get an exception like the one above, at which point they can register the function and try again. Kind of clunky, but the ability to make use of UDF's in views may be valuable.

          I am OK with that. We can reiterate that later without paying any development overhead.

          Show
          Zheng Shao added a comment - Regarding referencing a temporary function from a view: I guess we could just allow it, and leave it up to the user to reregister the function in a new Hive instance before referencing the view. If they forget, they'll get an exception like the one above, at which point they can register the function and try again. Kind of clunky, but the ability to make use of UDF's in views may be valuable. I am OK with that. We can reiterate that later without paying any development overhead.
          Hide
          John Sichi added a comment -

          Some notes for reviewers:

          (1) The reason I'm adding quoting for all identifiers (not just * expansion) during unparse has to do with persistence. Suppose today you create a table named units, and you create a view select x from units. Then tomorrow, you get a new version of hive in which units has become a reserved word. When we reparse the stored view definition, we'll get a parse error; to avoid this, we store it as select `x` from `units` instead (regardless of whether you quoted the identifiers in the initial CREATE VIEW statement). The same goes for references to other kinds of objects, so I'm attempting to future-proof it by doing it for functions as well (even though today those aren't persistent); there seem to be some existing inconsistencies with function name unescaping and lowercasing which I'm working through for the real patch submission.

          (2) For regenerating thrift code, I used the archived r760184 version of thrift linked at http://developers.facebook.com/thrift. This gave the minimum number of diffs from the existing generated code when compared with various other thrift versions I could find (but still a few spurious ones). After regeneration, I reverted files unrelated to my change. At some point, we should probably create a separate patch which brings our thrift generated code up to date with more recent versions.

          (3) I ended up putting the view text attributes on the Table class in the metastore, rather than StorageDescriptor as proposed by Prasad. In the future, if we need to track view definitions associated with partitions, adding one more attribute to Partition should do the job.

          (4) I'm not entirely sure what kind of metadata we want to store for view serde, inputFormat, outputFormat, location, since this physical information isn't really relevant until we do view materialization. Take a look at the create_view.q.out DESCRIBE EXTENDED output to see what's currently getting stored, and give me your feedback if you think changes are needed.

          Show
          John Sichi added a comment - Some notes for reviewers: (1) The reason I'm adding quoting for all identifiers (not just * expansion) during unparse has to do with persistence. Suppose today you create a table named units, and you create a view select x from units. Then tomorrow, you get a new version of hive in which units has become a reserved word. When we reparse the stored view definition, we'll get a parse error; to avoid this, we store it as select `x` from `units` instead (regardless of whether you quoted the identifiers in the initial CREATE VIEW statement). The same goes for references to other kinds of objects, so I'm attempting to future-proof it by doing it for functions as well (even though today those aren't persistent); there seem to be some existing inconsistencies with function name unescaping and lowercasing which I'm working through for the real patch submission. (2) For regenerating thrift code, I used the archived r760184 version of thrift linked at http://developers.facebook.com/thrift . This gave the minimum number of diffs from the existing generated code when compared with various other thrift versions I could find (but still a few spurious ones). After regeneration, I reverted files unrelated to my change. At some point, we should probably create a separate patch which brings our thrift generated code up to date with more recent versions. (3) I ended up putting the view text attributes on the Table class in the metastore, rather than StorageDescriptor as proposed by Prasad. In the future, if we need to track view definitions associated with partitions, adding one more attribute to Partition should do the job. (4) I'm not entirely sure what kind of metadata we want to store for view serde, inputFormat, outputFormat, location, since this physical information isn't really relevant until we do view materialization. Take a look at the create_view.q.out DESCRIBE EXTENDED output to see what's currently getting stored, and give me your feedback if you think changes are needed.
          Hide
          Prasad Chakka added a comment -

          regd (3)
          is storagedescritor needed at all for a view?

          Show
          Prasad Chakka added a comment - regd (3) is storagedescritor needed at all for a view?
          Hide
          John Sichi added a comment -

          @Prasad: Since we went with the flat catalog model, a view is just an MTable instance, so it inherits the SD attribute. If it's possible to set this attribute to null, I could do that.

          Show
          John Sichi added a comment - @Prasad: Since we went with the flat catalog model, a view is just an MTable instance, so it inherits the SD attribute. If it's possible to set this attribute to null, I could do that.
          Hide
          Prasad Chakka added a comment -

          I think I set a property in JDO ORM file to make it check for non-null SD. It shouldn't be that difficult to change that. How does a 'describe' command get the columns of a view?

          Show
          Prasad Chakka added a comment - I think I set a property in JDO ORM file to make it check for non-null SD. It shouldn't be that difficult to change that. How does a 'describe' command get the columns of a view?
          Hide
          John Sichi added a comment -

          @Prasad: DESCRIBE gets the columns of a view the same way as for a table (I store them as part of CREATE VIEW), so actually we do need to keep the StorageDescriptor there for holding the column descriptors. It's the SerDeInfo attribute of the StorageDescriptor, along with others mentioned above, that are not relevant for a view.

          Show
          John Sichi added a comment - @Prasad: DESCRIBE gets the columns of a view the same way as for a table (I store them as part of CREATE VIEW), so actually we do need to keep the StorageDescriptor there for holding the column descriptors. It's the SerDeInfo attribute of the StorageDescriptor, along with others mentioned above, that are not relevant for a view.
          Hide
          Prasad Chakka added a comment -

          Then looks like we do have to keep SD so what is the advantage of putting additional columns in Table instead of SD? I just feel that code and schema needs to be duplicated down the road. It is not a big deal but still why?

          I suppose the unnecessary attributes of SD can just be default values for views. Not sure much can be done there.

          Show
          Prasad Chakka added a comment - Then looks like we do have to keep SD so what is the advantage of putting additional columns in Table instead of SD? I just feel that code and schema needs to be duplicated down the road. It is not a big deal but still why? I suppose the unnecessary attributes of SD can just be default values for views. Not sure much can be done there.
          Hide
          John Sichi added a comment -

          Here's my rationale:

          (1) A view is a logical object, so by default, its definition fits naturally at the logical level (table) rather than the physical level (storage descriptor).

          (2) For a view object, we need both original text and expanded text. However, if we end up needing it per partition later for "very smart" materialized views, we'll only need the expanded text (the original text is only needed for user friendly DESCRIBE and for recompiling the view). So we would just add one new attribute on partition (not storage descriptor).

          (3) Most storage descriptors (describing partitions of tables) have nothing to do with views at all. So adding a new attribute there is just deadweight unless we actually get to "very smart" materialized views.

          By "very smart", I mean that we'll be able to avoid rematerializing old partitions when the view definition changes. This is the only reason I can think of why we would ever end up needing a view definition per partition (so that we can compare the old and new view definitions and see if one is derivable from the other). It seems like that is unlikely to happen very soon. Is there some other case where we'd need it?

          Show
          John Sichi added a comment - Here's my rationale: (1) A view is a logical object, so by default, its definition fits naturally at the logical level (table) rather than the physical level (storage descriptor). (2) For a view object, we need both original text and expanded text. However, if we end up needing it per partition later for "very smart" materialized views, we'll only need the expanded text (the original text is only needed for user friendly DESCRIBE and for recompiling the view). So we would just add one new attribute on partition (not storage descriptor). (3) Most storage descriptors (describing partitions of tables) have nothing to do with views at all. So adding a new attribute there is just deadweight unless we actually get to "very smart" materialized views. By "very smart", I mean that we'll be able to avoid rematerializing old partitions when the view definition changes. This is the only reason I can think of why we would ever end up needing a view definition per partition (so that we can compare the old and new view definitions and see if one is derivable from the other). It seems like that is unlikely to happen very soon. Is there some other case where we'd need it?
          Hide
          Prasad Chakka added a comment -

          I thought view definitions are supposed to be frozen when a view is created. Isn't that true of partitions as well? so why would an older partition need to be regenerated automatically when a view definition changes? or is the 'frozen definition' part doesn't apply to materialized views?

          Also the "dead-weight" doesn't matter since null valued columns add very little overhead in mysql.

          But I agree with 1 and 2 so let's keep the way you have them in the patch.

          Show
          Prasad Chakka added a comment - I thought view definitions are supposed to be frozen when a view is created. Isn't that true of partitions as well? so why would an older partition need to be regenerated automatically when a view definition changes? or is the 'frozen definition' part doesn't apply to materialized views? Also the "dead-weight" doesn't matter since null valued columns add very little overhead in mysql. But I agree with 1 and 2 so let's keep the way you have them in the patch.
          Hide
          John Sichi added a comment -

          OK, good.

          To clarify: the column descriptor metadata is what gets frozen when a view is defined, not any data associated with it. When the view (whether virtual or materialized) is redefined, then its column descriptors may change as well. For a materialized view, we would have a choice between discarding its existing partitions (implying they would need to be rematerialized according to the new view definition) versus attempting to preserve them, transforming their contents on the fly in order to match the updated view definition. For example, if the effect of the view redefinition was to delete one of the columns, then we could just project this column away when reading from the old partitions. This is purely theoretical until we have a design for materialized views (including how they support partitioning).

          Show
          John Sichi added a comment - OK, good. To clarify: the column descriptor metadata is what gets frozen when a view is defined, not any data associated with it. When the view (whether virtual or materialized) is redefined, then its column descriptors may change as well. For a materialized view, we would have a choice between discarding its existing partitions (implying they would need to be rematerialized according to the new view definition) versus attempting to preserve them, transforming their contents on the fly in order to match the updated view definition. For example, if the effect of the view redefinition was to delete one of the columns, then we could just project this column away when reading from the old partitions. This is purely theoretical until we have a design for materialized views (including how they support partitioning).
          Hide
          John Sichi added a comment -

          One other point for reviewers:

          If you do CREATE VIEW v(x,y) AS SELECT a+1,b+2 FROM t

          then we'll store the expanded view def as SELECT `_c0` AS `x`,`_c1` AS `y` FROM (SELECT `t`.`a`+1,`t`.`b`+2 FROM `t`) `v`

          As a result, if hive ever changes its convention for generating anonymous expression names (_c0, _c1, ...), such stored view definitions will break.

          The only way I can think of to avoid this is to implement SQL:200n's column-alias-by-position construct , e.g.

          (SELECT `t`.`a`+1,`t`.`b`+2 FROM `t`) AS `v`(`x`,`y`)

          (similar to what was done for LATERAL VIEW).

          Show
          John Sichi added a comment - One other point for reviewers: If you do CREATE VIEW v(x,y) AS SELECT a+1,b+2 FROM t then we'll store the expanded view def as SELECT `_c0` AS `x`,`_c1` AS `y` FROM (SELECT `t`.`a`+1,`t`.`b`+2 FROM `t`) `v` As a result, if hive ever changes its convention for generating anonymous expression names (_c0, _c1, ...), such stored view definitions will break. The only way I can think of to avoid this is to implement SQL:200n's column-alias-by-position construct , e.g. (SELECT `t`.`a`+1,`t`.`b`+2 FROM `t`) AS `v`(`x`,`y`) (similar to what was done for LATERAL VIEW).
          Hide
          John Sichi added a comment -

          Here is the real patch intended for submission.

          Show
          John Sichi added a comment - Here is the real patch intended for submission.
          Hide
          Zheng Shao added a comment -

          We had a group code-review with John and here are the notes.

          1. TODO: File a follow-up JIRA for the behavior for underlying table column type change.
          Currently we do store the column type of the view in the metastore but that's not always true - we use the underlying table's column type when the view is accessed.

          2. DDL unparse: we use antlr TokenRewriteStream to rewrite the original stream for "*" expansion.

          3. Error message. Recursively expanded as the comment above mentioned.

          4. TODO: We are using BLOB type in jdo. We need to test it out with mysql.

          5. TODO: Metadata change: We store "view" definitions in the same metadata table that we store "table" definitions.
          Shall we add a field "table type" so we know whether it's a table, external table, view, or materialized view in the future.
          We should clean up the additional useless fields in "view" - the test output shows that we are storing some garbage information for views.

          6. TODO: Clean up thrift and define the version of thrift that we should use. Check with Raghu and Ning.

          7. We add extra "select" with internal names like "_col0" etc in the view definition.

          8. TODO: We should add a test for using views with UDTF

          9. TODO: Add a test for a large view definition (in terms of the length of the view definition).

          10. isViews() should assert the other field is null

          11. createViewDesc: In DDLTask, there is some code populating derivedSchema from originalSchema. Shall we move that logic to compile time, so that we only need to pass derivedSchema.

          Show
          Zheng Shao added a comment - We had a group code-review with John and here are the notes. 1. TODO: File a follow-up JIRA for the behavior for underlying table column type change. Currently we do store the column type of the view in the metastore but that's not always true - we use the underlying table's column type when the view is accessed. 2. DDL unparse: we use antlr TokenRewriteStream to rewrite the original stream for "*" expansion. 3. Error message. Recursively expanded as the comment above mentioned. 4. TODO: We are using BLOB type in jdo. We need to test it out with mysql. 5. TODO: Metadata change: We store "view" definitions in the same metadata table that we store "table" definitions. Shall we add a field "table type" so we know whether it's a table, external table, view, or materialized view in the future. We should clean up the additional useless fields in "view" - the test output shows that we are storing some garbage information for views. 6. TODO: Clean up thrift and define the version of thrift that we should use. Check with Raghu and Ning. 7. We add extra "select" with internal names like "_col0" etc in the view definition. 8. TODO: We should add a test for using views with UDTF 9. TODO: Add a test for a large view definition (in terms of the length of the view definition). 10. isViews() should assert the other field is null 11. createViewDesc: In DDLTask, there is some code populating derivedSchema from originalSchema. Shall we move that logic to compile time, so that we only need to pass derivedSchema.
          Hide
          Zheng Shao added a comment -

          Another question is metastore database upgrade and backward compatibility.

          How do the user upgrade the metastore database (since we added columns to existing tables)?

          Will the new metastore database be compatible with old metastore code?

          Show
          Zheng Shao added a comment - Another question is metastore database upgrade and backward compatibility. How do the user upgrade the metastore database (since we added columns to existing tables)? Will the new metastore database be compatible with old metastore code?
          Hide
          Prasad Chakka added a comment -

          12. Block alter_table, alter_partition queries in metastore server side code rather than the ql or client side code.

          Zheng, AFAIK backward compatibility (i.e. older code acting against latest schema) should not be a problem. schema upgrade should happen automatically if users haven't changed or modified datanuclues/jpox parameters in hive-default.xml.

          Show
          Prasad Chakka added a comment - 12. Block alter_table, alter_partition queries in metastore server side code rather than the ql or client side code. Zheng, AFAIK backward compatibility (i.e. older code acting against latest schema) should not be a problem. schema upgrade should happen automatically if users haven't changed or modified datanuclues/jpox parameters in hive-default.xml.
          Hide
          Zheng Shao added a comment -

          Good to know we have backward compatibility!

          BTW, John, all these items can be done in a follow-up transaction if needed. Let's first resolve the conflicts with trunk and see if we can get this in before code clean-up and refactoring.

          Show
          Zheng Shao added a comment - Good to know we have backward compatibility! BTW, John, all these items can be done in a follow-up transaction if needed. Let's first resolve the conflicts with trunk and see if we can get this in before code clean-up and refactoring.
          Hide
          John Sichi added a comment -

          @Zheng: I just did a dry run of re-applying the patch on svn head, and there were no code conflicts. However, I wasn't able to re-run tests for the same reason as the current Hudson failures on head, so I guess we need to wait for that to get fixed.

          Show
          John Sichi added a comment - @Zheng: I just did a dry run of re-applying the patch on svn head, and there were no code conflicts. However, I wasn't able to re-run tests for the same reason as the current Hudson failures on head, so I guess we need to wait for that to get fixed.
          Hide
          Zheng Shao added a comment -

          @John, what are the test errors that you are getting? We should fix that as soon as possible.

          Show
          Zheng Shao added a comment - @John, what are the test errors that you are getting? We should fix that as soon as possible.
          Hide
          Zheng Shao added a comment -

          Namit just committed HIVE-1066 which should fix the test errors.
          Can you re-run the tests?

          Show
          Zheng Shao added a comment - Namit just committed HIVE-1066 which should fix the test errors. Can you re-run the tests?
          Hide
          John Sichi added a comment -

          Re-running tests now.

          Show
          John Sichi added a comment - Re-running tests now.
          Hide
          John Sichi added a comment -

          I have a patch which addresses items 8/9/10/11/12. Is it OK if I run that through tests and submit it, then address 1/5/6 in followup JIRA issues? They are easier to work on independent of the main patch.

          Show
          John Sichi added a comment - I have a patch which addresses items 8/9/10/11/12. Is it OK if I run that through tests and submit it, then address 1/5/6 in followup JIRA issues? They are easier to work on independent of the main patch.
          Hide
          Zheng Shao added a comment -

          Good to me. Please upload the new patch.

          Show
          Zheng Shao added a comment - Good to me. Please upload the new patch.
          Hide
          John Sichi added a comment -

          Here's the new patch (.3). Also adds a couple of testcases for exercising EXPLAIN with CREATE VIEW and with SELECT from view.

          Show
          John Sichi added a comment - Here's the new patch (.3). Also adds a couple of testcases for exercising EXPLAIN with CREATE VIEW and with SELECT from view.
          Hide
          Zheng Shao added a comment -

          John, the last point, can you do "4"? We definitely need to test it out with mysql.

          Show
          Zheng Shao added a comment - John, the last point, can you do "4"? We definitely need to test it out with mysql.
          Hide
          John Sichi added a comment -

          Yes, will do.

          Show
          John Sichi added a comment - Yes, will do.
          Hide
          John Sichi added a comment -

          I set up a MySQL metastore configuration, then created a table and view successfully. Browsing the resulting MySQL database, it looks like JDO chose MEDIUMTEXT for the type of the lob attributes. MEDIUMTEXT allows up to 2^24 bytes (16MB), which should be good enough for even very large views.

          Also, all tests are passing when I apply the .3 patch to svn head, so it's ready for commit.

          Show
          John Sichi added a comment - I set up a MySQL metastore configuration, then created a table and view successfully. Browsing the resulting MySQL database, it looks like JDO chose MEDIUMTEXT for the type of the lob attributes. MEDIUMTEXT allows up to 2^24 bytes (16MB), which should be good enough for even very large views. Also, all tests are passing when I apply the .3 patch to svn head, so it's ready for commit.
          Hide
          Zheng Shao added a comment -

          Thanks John! I will test and commit it now.
          Please open an follow-up issue and let's discuss the rest of things there.

          Show
          Zheng Shao added a comment - Thanks John! I will test and commit it now. Please open an follow-up issue and let's discuss the rest of things there.
          Hide
          Zheng Shao added a comment -

          Committed. Thanks John! Please open a follow-up issue and let's get the rest of the TODO items done!

          Show
          Zheng Shao added a comment - Committed. Thanks John! Please open a follow-up issue and let's get the rest of the TODO items done!
          Hide
          Jeff Hammerbacher added a comment -

          Was there a documentation update for this feature? I didn't notice a wiki change go by, but I may have missed it.

          Show
          Jeff Hammerbacher added a comment - Was there a documentation update for this feature? I didn't notice a wiki change go by, but I may have missed it.
          Hide
          Zheng Shao added a comment -

          I guess not yet. A good place to put it is http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL. We should mention it's available from 0.6.

          Show
          Zheng Shao added a comment - I guess not yet. A good place to put it is http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL . We should mention it's available from 0.6.
          Hide
          John Sichi added a comment -

          The wiki page describing the development so far is here:

          http://wiki.apache.org/hadoop/Hive/ViewDev

          I'll go ahead and update the LanguageManual and other relevant pages to describe the support that will be available starting with 0.6.

          Show
          John Sichi added a comment - The wiki page describing the development so far is here: http://wiki.apache.org/hadoop/Hive/ViewDev I'll go ahead and update the LanguageManual and other relevant pages to describe the support that will be available starting with 0.6.
          Hide
          John Sichi added a comment -

          I've created these followup issues and marked them as related:

          Show
          John Sichi added a comment - I've created these followup issues and marked them as related: HIVE-1067 HIVE-1068 HIVE-1069 HIVE-1070 HIVE-1073 HIVE-1074 HIVE-1076 HIVE-1077 HIVE-1078 HIVE-1079
          Hide
          John Sichi added a comment -

          For the wiki, I've updated the LanguageManual pages, and will follow on with updates in the various tutorials.

          Show
          John Sichi added a comment - For the wiki, I've updated the LanguageManual pages, and will follow on with updates in the various tutorials.
          Hide
          John Sichi added a comment -

          Metastore upgrade instructions are here (they assume HIVE-1068 will be accepted):

          http://wiki.apache.org/hadoop/Hive/ViewDev#Metastore_Upgrades

          There may be a more appropriate location for them in the wiki; once we have figured that out, we should link it from the release notes.

          Show
          John Sichi added a comment - Metastore upgrade instructions are here (they assume HIVE-1068 will be accepted): http://wiki.apache.org/hadoop/Hive/ViewDev#Metastore_Upgrades There may be a more appropriate location for them in the wiki; once we have figured that out, we should link it from the release notes.

            People

            • Assignee:
              John Sichi
              Reporter:
              Namit Jain
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development