Pig
  1. Pig
  2. PIG-2784

Framework for dynamic query optimization

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      We need a framework to implement dynamic query optimization, i.e. changing the query plan at runtime. Currently we support estimating the number of reducers dynamically, which works well as the first step but was not perfectly implemented. In near future, we'll support more dynamic optimization, like removing sample job for order-by, removing limit job, dynamically detecting skew and using skew-join, etc.

      Currently estimating #reducer is implemented in JobControlCompiler after MRCompiler compiles all the MapReduceOperators and generate the complete MRPlan. One place (discussed with Thejas) to implement the framework is at the MRCompiler, where the MRPlan'll be generated at batches and adjusted dynamically.

      Any comment?

      This is a candidate project for Google summer of code 2014. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2014

        Issue Links

          Activity

          Aniket Mokashi made changes -
          Assignee Aniket Mokashi [ aniket486 ]
          Hide
          camelia_c added a comment -

          Hello,

          My name is Camelia, I'm a PhD Student and I'm interested in working on this project. I just uploaded the project proposal on Google Melange and I hope that it is a good starting point for my future work on this contribution.

          Best regards,
          Camelia

          Show
          camelia_c added a comment - Hello, My name is Camelia, I'm a PhD Student and I'm interested in working on this project. I just uploaded the project proposal on Google Melange and I hope that it is a good starting point for my future work on this contribution. Best regards, Camelia
          Hide
          Aniket Mokashi added a comment - - edited

          Rajitha, Zhiwei Cai, thanks for your interest in this project.

          To proceed, you need to submit your proposal with the details on approach, plan etc. If you would like to clarify something, please use this jira as a place for discussions.

          Do we know about the size of data to process before Pig compile the job?

          Yes. This lets pig do reducer estimation.

          What's the difference between implementing this framework inside JobControlCompiler and inside MRCompiler? Which one do you think is better?

          MRCompiler deals with compiling physical plan into mapreduce operators and JobControlCompiler takes these compiled jobs and submits them to run on hadoop via hadoop's jobcontrol api. It's also responsible for maintaining progress report, stats etc. As part of this jira, you need to find out how we can take any (or all) of these optimizations and find the best place to plug them in. I will look forward to see your thoughts on how it should work.

          Do I need to consider more kind of optimization other than optimizations mentioned in the description? Is it possible that we categorize the optimizations into several types and make it easier to extend in the future?

          It would be nice if we can allow additions of new optimizations in future.

          Show
          Aniket Mokashi added a comment - - edited Rajitha , Zhiwei Cai , thanks for your interest in this project. To proceed, you need to submit your proposal with the details on approach, plan etc. If you would like to clarify something, please use this jira as a place for discussions. Do we know about the size of data to process before Pig compile the job? Yes. This lets pig do reducer estimation. What's the difference between implementing this framework inside JobControlCompiler and inside MRCompiler? Which one do you think is better? MRCompiler deals with compiling physical plan into mapreduce operators and JobControlCompiler takes these compiled jobs and submits them to run on hadoop via hadoop's jobcontrol api. It's also responsible for maintaining progress report, stats etc. As part of this jira, you need to find out how we can take any (or all) of these optimizations and find the best place to plug them in. I will look forward to see your thoughts on how it should work. Do I need to consider more kind of optimization other than optimizations mentioned in the description? Is it possible that we categorize the optimizations into several types and make it easier to extend in the future? It would be nice if we can allow additions of new optimizations in future.
          Hide
          Zhiwei Cai added a comment -

          Hi,

          My name is Zhiwei Cai and I'm writing a proposal for this project in GSOC 2014. I have some confusion about this idea and hope some of you can clarify it for me. I would be grateful if some of you can guide me in.
          1. Do we know about the size of data to process before Pig compile the job?
          2. What's the difference between implementing this framework inside JobControlCompiler and inside MRCompiler? Which one do you think is better?
          3. Do I need to consider more kind of optimization other than optimizations mentioned in the description? Is it possible that we categorize the optimizations into several types and make it easier to extend in the future?

          Best,
          Zhiwei

          Show
          Zhiwei Cai added a comment - Hi, My name is Zhiwei Cai and I'm writing a proposal for this project in GSOC 2014. I have some confusion about this idea and hope some of you can clarify it for me. I would be grateful if some of you can guide me in. 1. Do we know about the size of data to process before Pig compile the job? 2. What's the difference between implementing this framework inside JobControlCompiler and inside MRCompiler? Which one do you think is better? 3. Do I need to consider more kind of optimization other than optimizations mentioned in the description? Is it possible that we categorize the optimizations into several types and make it easier to extend in the future? Best, Zhiwei
          Hide
          Rajitha added a comment -

          I am very interested in contributing this project in GSoC 2014.

          I'm Rajitha Ranasinghe, third year undergraduate at the University of Moratuwa,Sri Lanka. As I'm currently working as a software engineering trainee in an open source company, I am involved in several open source projects and have a good practical knowledge and experience in open source project development.

          Currently I'm referring to the learning materials you have provided regarding the project to gain background knowledge. I need to discuss about this project. How can I proceed?

          Show
          Rajitha added a comment - I am very interested in contributing this project in GSoC 2014. I'm Rajitha Ranasinghe, third year undergraduate at the University of Moratuwa,Sri Lanka. As I'm currently working as a software engineering trainee in an open source company, I am involved in several open source projects and have a good practical knowledge and experience in open source project development. Currently I'm referring to the learning materials you have provided regarding the project to gain background knowledge. I need to discuss about this project. How can I proceed?
          Aniket Mokashi made changes -
          Assignee Aniket Mokashi [ aniket486 ]
          Daniel Dai made changes -
          Description We need a framework to implement dynamic query optimization, i.e. changing the query plan at runtime. Currently we support estimating the number of reducers dynamically, which works well as the first step but was not perfectly implemented. In near future, we'll support more dynamic optimization, like [removing sample job for order-by|https://issues.apache.org/jira/browse/PIG-483], [removing limit job|https://issues.apache.org/jira/browse/PIG-2675], dynamically detecting skew and using skew-join, etc.

          Currently estimating #reducer is implemented in JobControlCompiler after MRCompiler compiles all the MapReduceOperators and generate the complete MRPlan. One place (discussed with Thejas) to implement the framework is at the MRCompiler, where the MRPlan'll be generated at batches and adjusted dynamically.

          Any comment?
          We need a framework to implement dynamic query optimization, i.e. changing the query plan at runtime. Currently we support estimating the number of reducers dynamically, which works well as the first step but was not perfectly implemented. In near future, we'll support more dynamic optimization, like [removing sample job for order-by|https://issues.apache.org/jira/browse/PIG-483], [removing limit job|https://issues.apache.org/jira/browse/PIG-2675], dynamically detecting skew and using skew-join, etc.

          Currently estimating #reducer is implemented in JobControlCompiler after MRCompiler compiles all the MapReduceOperators and generate the complete MRPlan. One place (discussed with Thejas) to implement the framework is at the MRCompiler, where the MRPlan'll be generated at batches and adjusted dynamically.

          Any comment?

          This is a candidate project for Google summer of code 2014. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2014
          Aniket Mokashi made changes -
          Labels GSOC2014
          Aniket Mokashi made changes -
          Assignee Aniket Mokashi [ aniket486 ]
          Jie Li made changes -
          Link This issue relates to PIG-2779 [ PIG-2779 ]
          Jie Li made changes -
          Link This issue is related to PIG-2772 [ PIG-2772 ]
          Jie Li made changes -
          Link This issue is related to PIG-2675 [ PIG-2675 ]
          Jie Li made changes -
          Field Original Value New Value
          Link This issue is related to PIG-483 [ PIG-483 ]
          Jie Li created issue -

            People

            • Assignee:
              Aniket Mokashi
              Reporter:
              Jie Li
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development