Uploaded image for project: 'Phoenix'
  1. Phoenix
  2. PHOENIX-1772

Add CUBE/ROLLUP/GROUPING SET operators for advanced aggregations

    Details

    • Type: New Feature
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:

      Description

      I noticed from Phoenix language documentation ( http://phoenix.apache.org/language/index.html ) that Phoenix is missing CUBE/ROLLUP and GROUPING_SET operators which are already supported by other similar projects like Apache Pig and Apache Hive. Here is brief overview of my proposal (the syntax that is proposed below is same as PostgreSQL https://wiki.postgresql.org/wiki/Grouping_Sets)

      Proposed syntax for CUBE:
      SELECT name, place, SUM(count) FROM cars GROUP BY CUBE(name, place);

      For every row that we process we need to emit 2^n combinations of rows where n corresponds to number of aggregate columns. For the above example query, for every row we need to emit 4 rows, one for each level of aggregations

      {(name, place), (name, *), (*, place), (*, *)}

      .

      Proposed syntax for ROLLUP:
      SELECT name, place, SUM(count) FROM cars GROUP BY ROLLUP(name, place);

      For every row that we process we need to emit n+1 combinations of rows where n corresponds to number of aggregate columns. For the above example query, for every row we need to emit 3 rows, one for each hierarchical level of aggregations

      {(name, place), (name, *), (*, *)}

      .

      Propose syntax for GROUPING_SETS:
      SELECT name, place, SUM(count) FROM cars GROUP BY GROUPING SETS(name, ());

      For every row that we process we need to emit n combinations of rows where n corresponds to size of grouping set. For the above example query, for every row we need to emit 2 rows, one for each specified level of aggregations

      {(name, *), (*, *)}
      1. GSoCProposal.pdf
        177 kB
        Jayapriya Surendran

        Activity

        Hide
        jayapriya90 Jayapriya Surendran added a comment - - edited

        Hello everyone,
        I'm Jayapriya Surendran, currently pursuing MS in Computer Engineering at San Jose State University. I would like to implement this proposed idea in Phoenix as part of Google Summer of Code 2015. I'm really interested in distributed systems and I've learnt the basics of Hadoop and MapReduce from this Udacity course (https://www.udacity.com/course/ud617) offered by Cloudera. I've familiarized myself with Java,JUnit,Maven,IntelliJ and Git while implementing algorithms (https://github.com/jayapriya90/algorithms) .I have used these operators as a part of Data Mining and Data Warehousing course and found it really useful for many advanced aggregation use-cases. I am familiar with the concepts of these operations although I am new to Phoenix codebase. With mentorship from Phoenix committers, I believe I'll be able to complete these features in GSoC timeframe.

        Thanks and Regards
        Jayapriya Surendran

        Show
        jayapriya90 Jayapriya Surendran added a comment - - edited Hello everyone, I'm Jayapriya Surendran, currently pursuing MS in Computer Engineering at San Jose State University. I would like to implement this proposed idea in Phoenix as part of Google Summer of Code 2015. I'm really interested in distributed systems and I've learnt the basics of Hadoop and MapReduce from this Udacity course ( https://www.udacity.com/course/ud617 ) offered by Cloudera. I've familiarized myself with Java,JUnit,Maven,IntelliJ and Git while implementing algorithms ( https://github.com/jayapriya90/algorithms ) .I have used these operators as a part of Data Mining and Data Warehousing course and found it really useful for many advanced aggregation use-cases. I am familiar with the concepts of these operations although I am new to Phoenix codebase. With mentorship from Phoenix committers, I believe I'll be able to complete these features in GSoC timeframe. Thanks and Regards Jayapriya Surendran
        Hide
        jamestaylor James Taylor added a comment -

        Welcome, Jayapriya Surendran. Looking forward to your contributions. Just curious - what's the most common use case for CUBE and ROLLUP?

        Show
        jamestaylor James Taylor added a comment - Welcome, Jayapriya Surendran . Looking forward to your contributions. Just curious - what's the most common use case for CUBE and ROLLUP?
        Hide
        jayapriya90 Jayapriya Surendran added a comment -

        The most common use case for CUBE and ROLLUP would be to perform online and offline aggregations. The multi-dimensional pre-aggregated values can be stored in database/HBase so that business analyst can perform lookup queries. I have illustrated the use case with examples below

        Offline aggregation example:
        1) A company's sales information is collected in HDFS/DBMS every day.
        2) At the end of the day, a cube job is scheduled which computes aggregations for given dimensions and measures. Lets say dimensions are Country, State, City and measures are SUM(profit).
        3) Computing cube is time and space intensive. As dimensions increases, time and space for storing the cube increases. So cube is computed offline using hive/pig and the aggregated results are stored in DBMS/HBase for serving.
        4) Analyst can query the pre-aggregated results using look up queries like "what is the profit yesterday for USA, CA, San Jose?", "what is the overall profit yesterday for USA, CA?" or "what is the global overall profit yesterday?". All such queries can be answered without any expensive computations.

        Online aggregation example:
        1) Lets say we store all the tweets in HBase.
        2) Compute cube/rollup and store the aggregated results back to HBase. The stored cube (aggregated result) can be refreshed or discarded regularly.
        3) This will help in answering some queries like "How many tweets were posted in the last hour in USA, CA, San Jose?", "How many tweets were posted in USA in the past 3 hours about elections?" etc.

        I have seen similar use cases here http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/ and here
        https://engineering.linkedin.com/big-data/open-sourcing-cubert-high-performance-computation-engine-complex-big-data-analytics

        Show
        jayapriya90 Jayapriya Surendran added a comment - The most common use case for CUBE and ROLLUP would be to perform online and offline aggregations. The multi-dimensional pre-aggregated values can be stored in database/HBase so that business analyst can perform lookup queries. I have illustrated the use case with examples below Offline aggregation example: 1) A company's sales information is collected in HDFS/DBMS every day. 2) At the end of the day, a cube job is scheduled which computes aggregations for given dimensions and measures. Lets say dimensions are Country, State, City and measures are SUM(profit). 3) Computing cube is time and space intensive. As dimensions increases, time and space for storing the cube increases. So cube is computed offline using hive/pig and the aggregated results are stored in DBMS/HBase for serving. 4) Analyst can query the pre-aggregated results using look up queries like "what is the profit yesterday for USA, CA, San Jose?", "what is the overall profit yesterday for USA, CA?" or "what is the global overall profit yesterday?". All such queries can be answered without any expensive computations. Online aggregation example: 1) Lets say we store all the tweets in HBase. 2) Compute cube/rollup and store the aggregated results back to HBase. The stored cube (aggregated result) can be refreshed or discarded regularly. 3) This will help in answering some queries like "How many tweets were posted in the last hour in USA, CA, San Jose?", "How many tweets were posted in USA in the past 3 hours about elections?" etc. I have seen similar use cases here http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/ and here https://engineering.linkedin.com/big-data/open-sourcing-cubert-high-performance-computation-engine-complex-big-data-analytics
        Hide
        prasad Prasad Shivanna added a comment -

        Hi Jayapriya Surendran,

        I am very much interested in working on this project. I went through the proposal posted and through the Jira issue history while I was not able to find any specific commits related to this. I wanted to know more about the current status of this task.

        Regards,
        Prasad

        Show
        prasad Prasad Shivanna added a comment - Hi Jayapriya Surendran, I am very much interested in working on this project. I went through the proposal posted and through the Jira issue history while I was not able to find any specific commits related to this. I wanted to know more about the current status of this task. Regards, Prasad
        Hide
        jamestaylor James Taylor added a comment -

        Hello Prasad Shivanna - Jayapriya Surendran had proposed this work for the GSoC program last year, but unfortunately missed the deadline for submitting it (so no work as been done on it). I'm not sure what your situation is, but perhaps you could revive this idea by doing the same this year for GSoC (but the deadline is fast approaching), or an alternative would be to to do it on your own. I'm happy to provide technical assistance as needed.

        Show
        jamestaylor James Taylor added a comment - Hello Prasad Shivanna - Jayapriya Surendran had proposed this work for the GSoC program last year, but unfortunately missed the deadline for submitting it (so no work as been done on it). I'm not sure what your situation is, but perhaps you could revive this idea by doing the same this year for GSoC (but the deadline is fast approaching), or an alternative would be to to do it on your own. I'm happy to provide technical assistance as needed.
        Hide
        prasad Prasad Shivanna added a comment -

        Hi James,

        Thanks for the quick response. I have a rough estimate depicting how things can be done based on Kylin and few papers submitted by Google, I have submitted the same as a proposal. I am interested in working on this feature, with or without GSoC. I will start going through the documentation and get hold of the current status and start working towards setting up the dev environment and design document.

        Regards,
        Prasad

        Show
        prasad Prasad Shivanna added a comment - Hi James, Thanks for the quick response. I have a rough estimate depicting how things can be done based on Kylin and few papers submitted by Google, I have submitted the same as a proposal. I am interested in working on this feature, with or without GSoC. I will start going through the documentation and get hold of the current status and start working towards setting up the dev environment and design document. Regards, Prasad
        Hide
        prasad Prasad Shivanna added a comment -

        Hi James,
        can you provide me with some details on this issue so that do some reading. May be a design document, discussion mails, use cases, anything that can help me get started.

        Regards,
        Prasad

        Show
        prasad Prasad Shivanna added a comment - Hi James, can you provide me with some details on this issue so that do some reading. May be a design document, discussion mails, use cases, anything that can help me get started. Regards, Prasad
        Hide
        jamestaylor James Taylor added a comment -

        Hi Prasad,
        I think an interesting place to start would be to see how we can integrate with Kylin. Phoenix is aggressively moving to be on top of Calcite, also the underpinnings of Kylin (see our calcite branch where most all querying is already working). Assuming we do the work in the calcite branch, what would an integration look like? I think figuring out our options would be your main effort, with technical guidance from all three projects: Calcite, Kylin, and Phoenix. You'd need to drive it, though. Are you up for it, Prasad Shivanna?
        Regards,
        James

        Show
        jamestaylor James Taylor added a comment - Hi Prasad, I think an interesting place to start would be to see how we can integrate with Kylin. Phoenix is aggressively moving to be on top of Calcite, also the underpinnings of Kylin (see our calcite branch where most all querying is already working). Assuming we do the work in the calcite branch, what would an integration look like? I think figuring out our options would be your main effort, with technical guidance from all three projects: Calcite, Kylin, and Phoenix. You'd need to drive it, though. Are you up for it, Prasad Shivanna ? Regards, James
        Hide
        prasad Prasad Shivanna added a comment -

        Hi James,

        I was also thinking along the same line. I have some doubts behind the ideology of Kylin and Phoenix, please correct me if I'm wrong, Phoenix achieves its performance and does most of the work on servers by converting SQL queries to HBase scans and distributing them to Region servers, while Kylin relies on MR jobs to get the job done, won't this be a conflict at some point if we integrate both of them. Also , I'm pretty new to all of them, I need sometime to go through all the documentation and get hold of these projects. It might take sometime to get started for me. Yes, I'm up for it and very much excited about it.

        Regards,
        Prasad

        Show
        prasad Prasad Shivanna added a comment - Hi James, I was also thinking along the same line. I have some doubts behind the ideology of Kylin and Phoenix, please correct me if I'm wrong, Phoenix achieves its performance and does most of the work on servers by converting SQL queries to HBase scans and distributing them to Region servers, while Kylin relies on MR jobs to get the job done, won't this be a conflict at some point if we integrate both of them. Also , I'm pretty new to all of them, I need sometime to go through all the documentation and get hold of these projects. It might take sometime to get started for me. Yes, I'm up for it and very much excited about it. Regards, Prasad

          People

          • Assignee:
            Unassigned
            Reporter:
            jayapriya90 Jayapriya Surendran
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development