Hive
  1. Hive
  2. HIVE-1180

Support Common Table Expressions (CTEs) in Hive

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: Query Processor
    • Labels:
      None

      Description

      I've seen some presentations from the PostgreSQL recently expounding the utility of CTEs (http://en.wikipedia.org/wiki/Common_table_expressions). Should we try to support these in Hive? I've never used them in practice, so curious to hear if the community would find them useful.

      1. HIVE-1180.6.patch
        31 kB
        Harish Butani
      2. HIVE-1180.3.patch
        28 kB
        Harish Butani
      3. HIVE-1180.1.patch
        18 kB
        Harish Butani

        Issue Links

          Activity

          Hide
          Lefty Leverenz added a comment -

          For good measure, I mentioned CTEs in these sections, with links to the SELECT syntax and CTE doc:

          I also added a version note to the SELECT syntax:

          Show
          Lefty Leverenz added a comment - For good measure, I mentioned CTEs in these sections, with links to the SELECT syntax and CTE doc: DML: Inserting data into Hive Tables from queries DDL: CTAS DDL: Create View I also added a version note to the SELECT syntax: Select Syntax
          Hide
          Harish Butani added a comment -

          Looks good, thanks for editing.
          Changed COMMA to ","

          Show
          Harish Butani added a comment - Looks good, thanks for editing. Changed COMMA to ","
          Hide
          Lefty Leverenz added a comment -

          Apologies for the delay, Harish Butani. I made these changes (please revert anything you disagree with):

          I also recommend changing COMMA to "," in the syntax – just because I recently got confused by DOT for "." in the DDL wiki – but since the examples show actual commas there shouldn't be any confusion here.

          Show
          Lefty Leverenz added a comment - Apologies for the delay, Harish Butani . I made these changes (please revert anything you disagree with): changes to Common Table Expression current CTE doc I also recommend changing COMMA to "," in the syntax – just because I recently got confused by DOT for "." in the DDL wiki – but since the examples show actual commas there shouldn't be any confusion here.
          Show
          Harish Butani added a comment - Can you review: https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression Also added CTE to Select Page: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select
          Hide
          Lefty Leverenz added a comment -

          Ordinarily I'd say you can write something up in a text file and I'll format it for the wiki, but the more the merrier so welcome to wiki-land.

          Show
          Lefty Leverenz added a comment - Ordinarily I'd say you can write something up in a text file and I'll format it for the wiki, but the more the merrier so welcome to wiki-land.
          Hide
          Harish Butani added a comment -

          Yes only for Select.
          Yes a section like Group By.
          This is embarrassing, I don't have wiki access; never gotten around to getting edit privileges.
          Once I get it, I will add a section, that you can review.

          Show
          Harish Butani added a comment - Yes only for Select. Yes a section like Group By. This is embarrassing, I don't have wiki access; never gotten around to getting edit privileges. Once I get it, I will add a section, that you can review.
          Hide
          Lefty Leverenz added a comment -

          This needs to be documented in the wiki. Is it only for SELECT statements?

          Wikipedia says it's for SELECT, INSERT, UPDATE (not an issue here), or DELETE. The PostgreSQL doc just says SELECT.

          If it's just SELECT, it can be a new section in the Select doc or a separate section linked from Select like the Group By, Sort/Order/Cluster/Distribute, and Join docs.

          Quick ref:

          Show
          Lefty Leverenz added a comment - This needs to be documented in the wiki. Is it only for SELECT statements? Wikipedia says it's for SELECT, INSERT, UPDATE (not an issue here), or DELETE. The PostgreSQL doc just says SELECT. If it's just SELECT, it can be a new section in the Select doc or a separate section linked from Select like the Group By, Sort/Order/Cluster/Distribute, and Join docs. Quick ref: Language Manual: Select Wikipedia: Common table expression PostgreSQL CTE readme
          Hide
          Harish Butani added a comment -

          thanks for the review Gunther

          Show
          Harish Butani added a comment - thanks for the review Gunther
          Hide
          Harish Butani added a comment -

          I checked that the failing Minimr tests pass locally.

          Show
          Harish Butani added a comment - I checked that the failing Minimr tests pass locally.
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12627055/HIVE-1180.6.patch

          ERROR: -1 due to 2 failed/errored test(s), 5014 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
          org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_merge
          

          Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1197/testReport
          Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1197/console

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 2 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12627055

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12627055/HIVE-1180.6.patch ERROR: -1 due to 2 failed/errored test(s), 5014 tests executed Failed tests: org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_merge Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1197/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1197/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed This message is automatically generated. ATTACHMENT ID: 12627055
          Hide
          Gunther Hagleitner added a comment -

          +1 (assuming tests on .6 pass)

          Show
          Gunther Hagleitner added a comment - +1 (assuming tests on .6 pass)
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12626819/HIVE-1180.3.patch

          ERROR: -1 due to 2 failed/errored test(s), 5001 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_parallel_orderby
          org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_invalid_priv_v1
          

          Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1178/testReport
          Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1178/console

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests exited with: TestsFailedException: 2 tests failed
          

          This message is automatically generated.

          ATTACHMENT ID: 12626819

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12626819/HIVE-1180.3.patch ERROR: -1 due to 2 failed/errored test(s), 5001 tests executed Failed tests: org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_parallel_orderby org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_invalid_priv_v1 Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1178/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1178/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed This message is automatically generated. ATTACHMENT ID: 12626819
          Hide
          Gunther Hagleitner added a comment -

          Looks good! Some comments on rb.

          Show
          Gunther Hagleitner added a comment - Looks good! Some comments on rb.
          Hide
          Harish Butani added a comment -
          Show
          Harish Butani added a comment - Review at https://reviews.apache.org/r/17691/
          Hide
          Harish Butani added a comment -

          Thanks Gunther.

          • I will add the tests you suggest
          • yes multiple ctes are allowed, can refer to each other as long there are no cycles. There is a test for this.
          • yes i agree CTE should be allowed with CTAS/create table, views. I thought by adding CTE support to queryStatExpr this should happen. Will look into this.
          • Will file a follow up jira to support sharing a plan across multiple references to CTEs/views.
          Show
          Harish Butani added a comment - Thanks Gunther. I will add the tests you suggest yes multiple ctes are allowed, can refer to each other as long there are no cycles. There is a test for this. yes i agree CTE should be allowed with CTAS/create table, views. I thought by adding CTE support to queryStatExpr this should happen. Will look into this. Will file a follow up jira to support sharing a plan across multiple references to CTEs/views.
          Hide
          Gunther Hagleitner added a comment -

          Also, I think 'recursive' would be really neat: http://explainextended.com/2013/12/31/happy-new-year-5/

          Show
          Gunther Hagleitner added a comment - Also, I think 'recursive' would be really neat: http://explainextended.com/2013/12/31/happy-new-year-5/
          Hide
          Gunther Hagleitner added a comment -

          I think it's reasonable to do non-recursive, non-materialized CTEs in a first step. It still gives you some of the benefits of CTEs (temporary "view" without metastore, multiple refs to the same subquery, simpler to write queries.) It will, however, still be more efficient in some cases to create/drop table to avoid re-computation. We should probably document and create follow up jira right away.

          I've gone through the initial patch and it looks good so far. Some questions/suggestions for test cases to add:

          • CTAS/create table like: Should also be supported, right? Currently you won't be able to specify that.
          • Negative test: Multiple with statements in the same query
          • What about multiple CTEs, can they referencing one another? If not, is this caught?
          • Insert statement: From the grammar it looks like you might be able to specify the CTE as the target table also (negative test), otherwise we should probably test that you can use it as a source.
          • Order by, sort by: These don't really make sense in a CTE - should they be disallowed?
          • Top level union with CTE is probably worth a test.
          • CTE in view definition: Should be allowed too, right? Currently I don't think it is.
          Show
          Gunther Hagleitner added a comment - I think it's reasonable to do non-recursive, non-materialized CTEs in a first step. It still gives you some of the benefits of CTEs (temporary "view" without metastore, multiple refs to the same subquery, simpler to write queries.) It will, however, still be more efficient in some cases to create/drop table to avoid re-computation. We should probably document and create follow up jira right away. I've gone through the initial patch and it looks good so far. Some questions/suggestions for test cases to add: CTAS/create table like: Should also be supported, right? Currently you won't be able to specify that. Negative test: Multiple with statements in the same query What about multiple CTEs, can they referencing one another? If not, is this caught? Insert statement: From the grammar it looks like you might be able to specify the CTE as the target table also (negative test), otherwise we should probably test that you can use it as a source. Order by, sort by: These don't really make sense in a CTE - should they be disallowed? Top level union with CTE is probably worth a test. CTE in view definition: Should be allowed too, right? Currently I don't think it is.
          Hide
          Harish Butani added a comment -

          As a first step have converted references to a CTE in the Query into a SubQuery invocation.
          This doesn't address the issue of materializing a CTE, which can then be used in multiple parts of the Query.
          But this issue exists for views today. For e.g. the following query:

          create view v1 as
          select key, min(value) as value 
          from src
          group by key
          ;
          explain
          select a.key, b.key
          from v1 a join v1 b on a.key = b.key;
          

          Does the Group by on src twice.

          Can we address the issue of materialization in a subsequent jira?

          Have attached a preliminary patch.

          Show
          Harish Butani added a comment - As a first step have converted references to a CTE in the Query into a SubQuery invocation. This doesn't address the issue of materializing a CTE, which can then be used in multiple parts of the Query. But this issue exists for views today. For e.g. the following query: create view v1 as select key, min(value) as value from src group by key ; explain select a.key, b.key from v1 a join v1 b on a.key = b.key; Does the Group by on src twice. Can we address the issue of materialization in a subsequent jira? Have attached a preliminary patch.
          Hide
          Ning Zhang added a comment -

          The session-level temp table has not been implemented yet or in the near-future roadmap, but with the metastore's extension to keep the type of an object (whether it is a table or a view), it is relatively straightforward to add a new "temp table" type. The only thing is to change the DDL to support 'create temporary table ...), and drop the temp table once a session ended.

          Agreed with your point about the recursive queries. It is easy for the user to write recursive queries, but hard for execution and bad for resource management.

          Show
          Ning Zhang added a comment - The session-level temp table has not been implemented yet or in the near-future roadmap, but with the metastore's extension to keep the type of an object (whether it is a table or a view), it is relatively straightforward to add a new "temp table" type. The only thing is to change the DDL to support 'create temporary table ...), and drop the temp table once a session ended. Agreed with your point about the recursive queries. It is easy for the user to write recursive queries, but hard for execution and bad for resource management.
          Hide
          Edward Capriolo added a comment -

          (2) is a good addition to Hive, given that we have views now and need to add the support of session-level temp tables.
          Question, how are we going to support session level temp tables Files/tables in HDFS are going to be tied into SessionState or a session ID in someway?

          (1) From my experience map/reduce does not lend itself well to recursive work. Currently if each recursion was a map/reduce job that would be a major strain on hadoop. (JobHistory would have explosive growth from a few queries) Maybe there is a simple way around this.

          However it is possible/practical recursive processing does open up interesting queries on self joined tables.
          id name parent

          1 adam null
          2 caan 1
          3 able 1
          

          Select "ALL DESCENDANTS OF ADAM"

          Show
          Edward Capriolo added a comment - (2) is a good addition to Hive, given that we have views now and need to add the support of session-level temp tables. Question, how are we going to support session level temp tables Files/tables in HDFS are going to be tied into SessionState or a session ID in someway? (1) From my experience map/reduce does not lend itself well to recursive work. Currently if each recursion was a map/reduce job that would be a major strain on hadoop. (JobHistory would have explosive growth from a few queries) Maybe there is a simple way around this. However it is possible/practical recursive processing does open up interesting queries on self joined tables. id name parent 1 adam null 2 caan 1 3 able 1 Select "ALL DESCENDANTS OF ADAM"
          Hide
          Ning Zhang added a comment -

          It seems that he benefits of CTE are 1) its support of recursion and 2) query-level temp table materialization (if the optimizer chooses so).

          I think (2) is a good addition to Hive, given that we have views now and need to add the support of session-level temp tables.

          (1) is a major extension and it may be unsafe – the query may never terminate (well UDFs are undeciable, but HiveQL itself is currently). I was wondering how many customers really need recursions given that we have UDF support.

          Show
          Ning Zhang added a comment - It seems that he benefits of CTE are 1) its support of recursion and 2) query-level temp table materialization (if the optimizer chooses so). I think (2) is a good addition to Hive, given that we have views now and need to add the support of session-level temp tables. (1) is a major extension and it may be unsafe – the query may never terminate (well UDFs are undeciable, but HiveQL itself is currently). I was wondering how many customers really need recursions given that we have UDF support.
          Hide
          Carl Steinbach added a comment -
          Show
          Carl Steinbach added a comment - PostgreSQL's CTE Readme

            People

            • Assignee:
              Harish Butani
              Reporter:
              Jeff Hammerbacher
            • Votes:
              6 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development