Pig
  1. Pig
  2. PIG-157

Add types and rework execution pipeline

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.2.0
    • Component/s: impl
    • Labels:
      None

      Description

      This is the tracking bug for the work to add types to pig and rework the execution pipeline. Individual components of this work are covered in subtasks.

      Functional and design specs for this work are:
      http://wiki.apache.org/pig/PigTypesFunctionalSpec
      http://wiki.apache.org/pig/PigTypesDesign
      http://wiki.apache.org/pig/PigExecutionModel

      This work is being done on the branch types, since it is large and disruptive, and we want to be able to do incremental checkins without causing issues for the trunk.

      1. Core.patch.zip
        33 kB
        Shravan Matthur Narayanamurthy
      2. incr1.zip
        13 kB
        Shravan Matthur Narayanamurthy
      3. exceptions.patch
        70 kB
        Alan Gates

        Issue Links

          Activity

          Hide
          Alan Gates added a comment -

          The types branch has been moved to trunk.

          Show
          Alan Gates added a comment - The types branch has been moved to trunk.
          Hide
          Alan Gates added a comment -

          With my latest checkin (revision 658485) we are getting close to tying all the pieces together. I know of the following outstanding issues:

          1. HExecutionEngine and LocalExecutionEngine compile() methods need to plug into Shubham's logical to physical translator. Shubham has submitted a patch with the translator but was still seeing some unit test failures, so I have not tied it in yet.
          2. explain needs to be fleshed out in HExecutionEngine and LocalExecutionEngine. To do this, POPrinter will have to be converted to the new physical operators and MRPrinter will have to be written (it's just a shell right now).
          3. FileLocalizer.openDFSFile() needs access to JobConf in order to open the file. Previously it went through PigRecordReader, but this was a hack. Shravan is looking at how this should be done.
          4. Currently logical plans returned by registerQuery aren't being stitched together. Santhosh is looking at how to resolve this. Once that is resolved the resulting plan need to be passed to the validators and then on to the appropriate execution engine.
          5. PigServer.totalHadoopTimeSpent doesn't work yet. Shravan changed Launcher so that it tracks time spent, but PigServer does not have a reference to the launcher there. Need to figure out how to get it a reference to the launcher object, or maybe that value should be static, not sure.
          6. 2 final tests in TestBuiltin are failing. They are commented out for now. But it looks to me like they should pass. They fail in the parser. Santhosh, could you take a look at this.
          Show
          Alan Gates added a comment - With my latest checkin (revision 658485) we are getting close to tying all the pieces together. I know of the following outstanding issues: HExecutionEngine and LocalExecutionEngine compile() methods need to plug into Shubham's logical to physical translator. Shubham has submitted a patch with the translator but was still seeing some unit test failures, so I have not tied it in yet. explain needs to be fleshed out in HExecutionEngine and LocalExecutionEngine. To do this, POPrinter will have to be converted to the new physical operators and MRPrinter will have to be written (it's just a shell right now). FileLocalizer.openDFSFile() needs access to JobConf in order to open the file. Previously it went through PigRecordReader, but this was a hack. Shravan is looking at how this should be done. Currently logical plans returned by registerQuery aren't being stitched together. Santhosh is looking at how to resolve this. Once that is resolved the resulting plan need to be passed to the validators and then on to the appropriate execution engine. PigServer.totalHadoopTimeSpent doesn't work yet. Shravan changed Launcher so that it tracks time spent, but PigServer does not have a reference to the launcher there. Need to figure out how to get it a reference to the launcher object, or maybe that value should be static, not sure. 2 final tests in TestBuiltin are failing. They are commented out for now. But it looks to me like they should pass. They fail in the parser. Santhosh, could you take a look at this.
          Hide
          Olga Natkovich added a comment -

          Sounds good, then. It would be nice to add links to the page about the LocalJobRunner or a summary of how it works.

          Show
          Olga Natkovich added a comment - Sounds good, then. It would be nice to add links to the page about the LocalJobRunner or a summary of how it works.
          Hide
          Alan Gates added a comment -

          In response to Olga's comment on LocalJobRunner:

          Using the LocalJobRunner to execute our jobs locally would not require the user to set up hadoop, or anything else. We won't be connecting an outside hadoop instance. All we're proposing to do is make use of a class that hadoop already has to manage our execution. This is no different than making use of log4j to do logging. We just have to embed the appropriate jar in our jar.

          Show
          Alan Gates added a comment - In response to Olga's comment on LocalJobRunner: Using the LocalJobRunner to execute our jobs locally would not require the user to set up hadoop, or anything else. We won't be connecting an outside hadoop instance. All we're proposing to do is make use of a class that hadoop already has to manage our execution. This is no different than making use of log4j to do logging. We just have to embed the appropriate jar in our jar.
          Hide
          Olga Natkovich added a comment -

          This comment is for LocalJobRunner proposal.

          One good thing about independent local mode is that running pig in this mode does not require hadoop at all which makes it ideal for trying pig out and for building tutorials on top of it. It is also good becuase it allows us to maintain and test backend independence of Pig.

          Show
          Olga Natkovich added a comment - This comment is for LocalJobRunner proposal. One good thing about independent local mode is that running pig in this mode does not require hadoop at all which makes it ideal for trying pig out and for building tutorials on top of it. It is also good becuase it allows us to maintain and test backend independence of Pig.
          Hide
          Shravan Matthur Narayanamurthy added a comment -

          Thanks for the comments Pi.

          1) First concern is that using Hadoop Local will tie us to Hadoop too much.
          There was an initiative quite a while ago to start looking at different backends other than Hadoop (e.g. we might be running a backend like SETI@home. Who knows?).

          However, this whole thing seems to have been built for solely Hadoop anyway. Not sure about the current direction.
          [shrav] I don't think this ties us down to Hadoop in the sense that we can't have other backends. We just resue some hadoop code thats all. The only thing I see tied to haddop is that at max we would need to supply the hadoop jar with pig which we already do.

          2) Have you tried to measure LocalHadoop startup time compared to the local engine? If the LocalHadoop takes much more time to startup, we might suffer when processing nested queries.
          [shrav] The LoaclHadoop has a startup time of about 6 secs. But if we are processing even like 10 MB of data, the LocalHadoop mysteriously beats the local engine hands down. For the local engine I presumed that it would just take the leaf operator which will be a POStore and call the store() method.
          For about 12MB of data, the LocalHadoop took about 11 sec whereas the local engine took about 15 sec.

          As far as the nested plan in foreach goes, at least currently, we won't be creating an instance of a local engine to run the nested plan. Currently, all operators that can be used inside the nested plan have been implemented such that the generic plan execution model with attachInputs called on the inner plan will work fine. However, if we decide to have all the operators inside the nested plan, then we will have to do changes to the MRCompiler so that the nested foreach becomes a blocking operator and should be handled separately by spawning new MR jobs to process the plan inside. In this case, invoking LocalHadoop would probably not make sense. The executable operator plan is a better option here as it would also entail that there would not be any changes to the MRCompiler.

          So, at least now, LocalJobRunner will not be invoked inside the MapReduce execution for executing nested plans. The LocalJobRunner will be strictly used only when the user is in local execution mode.

          I will update the wiki with these comments.
          Thanks for the inputs Pi. I had not thought about the nested for each when it grows full blown.

          Show
          Shravan Matthur Narayanamurthy added a comment - Thanks for the comments Pi. 1) First concern is that using Hadoop Local will tie us to Hadoop too much. There was an initiative quite a while ago to start looking at different backends other than Hadoop (e.g. we might be running a backend like SETI@home. Who knows?). However, this whole thing seems to have been built for solely Hadoop anyway. Not sure about the current direction. [shrav] I don't think this ties us down to Hadoop in the sense that we can't have other backends. We just resue some hadoop code thats all. The only thing I see tied to haddop is that at max we would need to supply the hadoop jar with pig which we already do. 2) Have you tried to measure LocalHadoop startup time compared to the local engine? If the LocalHadoop takes much more time to startup, we might suffer when processing nested queries. [shrav] The LoaclHadoop has a startup time of about 6 secs. But if we are processing even like 10 MB of data, the LocalHadoop mysteriously beats the local engine hands down. For the local engine I presumed that it would just take the leaf operator which will be a POStore and call the store() method. For about 12MB of data, the LocalHadoop took about 11 sec whereas the local engine took about 15 sec. As far as the nested plan in foreach goes, at least currently, we won't be creating an instance of a local engine to run the nested plan. Currently, all operators that can be used inside the nested plan have been implemented such that the generic plan execution model with attachInputs called on the inner plan will work fine. However, if we decide to have all the operators inside the nested plan, then we will have to do changes to the MRCompiler so that the nested foreach becomes a blocking operator and should be handled separately by spawning new MR jobs to process the plan inside. In this case, invoking LocalHadoop would probably not make sense. The executable operator plan is a better option here as it would also entail that there would not be any changes to the MRCompiler. So, at least now, LocalJobRunner will not be invoked inside the MapReduce execution for executing nested plans. The LocalJobRunner will be strictly used only when the user is in local execution mode. I will update the wiki with these comments. Thanks for the inputs Pi. I had not thought about the nested for each when it grows full blown.
          Hide
          Pi Song added a comment -

          Good to see people start writing ideas to Wiki!!

          1) First concern is that using Hadoop Local will tie us to Hadoop too much.
          There was an initiative quite a while ago to start looking at different backends other than Hadoop (e.g. we might be running a backend like SETI@home. Who knows?).

          However, this whole thing seems to have been built for solely Hadoop anyway. Not sure about the current direction.

          2) Have you tried to measure LocalHadoop startup time compared to the local engine? If the LocalHadoop takes much more time to startup, we might suffer when processing nested queries.

          Show
          Pi Song added a comment - Good to see people start writing ideas to Wiki!! 1) First concern is that using Hadoop Local will tie us to Hadoop too much. There was an initiative quite a while ago to start looking at different backends other than Hadoop (e.g. we might be running a backend like SETI@home. Who knows?). However, this whole thing seems to have been built for solely Hadoop anyway. Not sure about the current direction. 2) Have you tried to measure LocalHadoop startup time compared to the local engine? If the LocalHadoop takes much more time to startup, we might suffer when processing nested queries.
          Hide
          Shravan Matthur Narayanamurthy added a comment -

          Created the wiki for using Hadoop's LocalJobRunner for local execution mode
          http://wiki.apache.org/pig/LocalJobRunner
          Please take a look

          Show
          Shravan Matthur Narayanamurthy added a comment - Created the wiki for using Hadoop's LocalJobRunner for local execution mode http://wiki.apache.org/pig/LocalJobRunner Please take a look
          Hide
          Pi Song added a comment -

          +1
          Those IOException things has been irritating my eyes for quite a while.

          Show
          Pi Song added a comment - +1 Those IOException things has been irritating my eyes for quite a while.
          Hide
          Alan Gates added a comment -

          I have attached a patch that contains a rework of the exceptions. I'm trying to rework the exception handling according to pig cookbook (http://wiki.apache.org/pig/PigDeveloperCookbook). In this patch I've only reworked areas that we've rewritten, I have not reworked areas we haven't touched (yet). We can adapt exceptions in other areas as we rework them.I have also not altered any interfaces, so there are a number of places I end up converting exception types (e.g. in the builtin functions exec functions).

          Major changes I made:

          • Changed LogicalOperator.getSchema to throw FronendException instead of IOException
          • Added PlanException to org.apache.pig.impl.plan package and changed OperatorPlan.connect to throw PlanException instead of IOException. PlanException extends FrontendException. Also made VisitorException extend PlanException.
          • Changed DataType conversion routines to throw ExecException instead of IOException
          • Changed Tuple.get, Tuple.set, and a few other Tuple methods to throw ExecException instead of IOException.
          Show
          Alan Gates added a comment - I have attached a patch that contains a rework of the exceptions. I'm trying to rework the exception handling according to pig cookbook ( http://wiki.apache.org/pig/PigDeveloperCookbook ). In this patch I've only reworked areas that we've rewritten, I have not reworked areas we haven't touched (yet). We can adapt exceptions in other areas as we rework them.I have also not altered any interfaces, so there are a number of places I end up converting exception types (e.g. in the builtin functions exec functions). Major changes I made: Changed LogicalOperator.getSchema to throw FronendException instead of IOException Added PlanException to org.apache.pig.impl.plan package and changed OperatorPlan.connect to throw PlanException instead of IOException. PlanException extends FrontendException. Also made VisitorException extend PlanException. Changed DataType conversion routines to throw ExecException instead of IOException Changed Tuple.get, Tuple.set, and a few other Tuple methods to throw ExecException instead of IOException.
          Hide
          Alan Gates added a comment -

          I'll attach comments on Shravan's incr1.zip patch to issue 161, as his patch deals primarily with reworking the physical plan.

          Show
          Alan Gates added a comment - I'll attach comments on Shravan's incr1.zip patch to issue 161, as his patch deals primarily with reworking the physical plan.
          Hide
          Shravan Matthur Narayanamurthy added a comment - - edited

          incr1.zip:
          With alan's coments but a smaller one. Has the abstract classes for Physical rework and some physical operators. Also includes test cases for these.

          Show
          Shravan Matthur Narayanamurthy added a comment - - edited incr1.zip: With alan's coments but a smaller one. Has the abstract classes for Physical rework and some physical operators. Also includes test cases for these.
          Hide
          Shravan Matthur Narayanamurthy added a comment -

          Thanks for reviewing the patch Alan.

          1) I will modify the tests and create unit tests. However I'll only be able to submit a smaller patch then.

          2) I will change that.

          3) Sure I will put the TODO FIX.

          So please go ahead and take a look at PlanVisitor.java. That is the only class in the plan pkg that has changed. But do not commit this. I will submit a new patch with the suggested changes. If you are ok with PlanVisitor, then you can commit this patch instead.

          Show
          Shravan Matthur Narayanamurthy added a comment - Thanks for reviewing the patch Alan. 1) I will modify the tests and create unit tests. However I'll only be able to submit a smaller patch then. 2) I will change that. 3) Sure I will put the TODO FIX. So please go ahead and take a look at PlanVisitor.java. That is the only class in the plan pkg that has changed. But do not commit this. I will submit a new patch with the suggested changes. If you are ok with PlanVisitor, then you can commit this patch instead.
          Hide
          Alan Gates added a comment -

          This patch is big, with several changes in it. When possible, let's try to break up patches so that their easier to review and apply.

          I've broken out the changes in the data directory and the builtin directory and applied those, along with a few changes of my own. So at this point, the whole data directory builds, as does the whole pig/builtin (but not pig/impl/builtin (why do we have two directories named builtin?)) directory. I've added those directories to the build.xml file. I've also changed build.xml to only run the unit tests we expect to pass. At the moment that's TestOperatorPlan and TestBuiltin.

          You've also included changes to Operator plan in this patch. I'll take a look at that tomorrow and see if that can be committed.

          As far as the changes for physical operator, I have a few questions/comments:

          1) The places you put tests directly in the code should be converted to junit tests before we commit the patch. This will make it easy for other developers to run the tests and assure that they haven't broken your code as part of their changes. When you add these tests, please be sure to add them to build.xml as part of the tests we expect to pass.

          2) Why did you move physical operators from the directory "physicalLayer" to "physical"? Unless there's a compelling reason, we should keep the same directory structure.

          3) I have no problem with commenting out parts of the code so that we can compile now. Whenever we do that, let's put a tag there that is easy to find with a global search. I've been using TODO FIX. If you can change the code in your patch to include this that will help us find all those locations later.

          Show
          Alan Gates added a comment - This patch is big, with several changes in it. When possible, let's try to break up patches so that their easier to review and apply. I've broken out the changes in the data directory and the builtin directory and applied those, along with a few changes of my own. So at this point, the whole data directory builds, as does the whole pig/builtin (but not pig/impl/builtin (why do we have two directories named builtin?)) directory. I've added those directories to the build.xml file. I've also changed build.xml to only run the unit tests we expect to pass. At the moment that's TestOperatorPlan and TestBuiltin. You've also included changes to Operator plan in this patch. I'll take a look at that tomorrow and see if that can be committed. As far as the changes for physical operator, I have a few questions/comments: 1) The places you put tests directly in the code should be converted to junit tests before we commit the patch. This will make it easy for other developers to run the tests and assure that they haven't broken your code as part of their changes. When you add these tests, please be sure to add them to build.xml as part of the tests we expect to pass. 2) Why did you move physical operators from the directory "physicalLayer" to "physical"? Unless there's a compelling reason, we should keep the same directory structure. 3) I have no problem with commenting out parts of the code so that we can compile now. Whenever we do that, let's put a tag there that is easy to find with a global search. I've been using TODO FIX. If you can change the code in your patch to include this that will help us find all those locations later.
          Hide
          Shravan Matthur Narayanamurthy added a comment - - edited

          This is a core sorts of a patch. I have fixed compile errors so that all the code I have written till date gets compiled under the current patch. Obviously, there were a lot of dependencies and I had to resolve them by crude means by commenting out irrelevant but non-compliant functions. I am hoping that as soon as the logical stuff is done, we should have access to all the data classes at the least. So the things that are missing/commented are SortedBag and DistinctBag as they used evalSpecs still. The dependencies on PigContext and backend portion has been mostly taken care of. I would like the types branch to be updated to this patch asap.

          Also attached is a small shell script that I wrote to help me in creating patches. Its not in perfect shape taking all corner cases into consideration. If used as suggested, it should be helpful is what I feel. Please take a look and feel free to comment.

          Show
          Shravan Matthur Narayanamurthy added a comment - - edited This is a core sorts of a patch. I have fixed compile errors so that all the code I have written till date gets compiled under the current patch. Obviously, there were a lot of dependencies and I had to resolve them by crude means by commenting out irrelevant but non-compliant functions. I am hoping that as soon as the logical stuff is done, we should have access to all the data classes at the least. So the things that are missing/commented are SortedBag and DistinctBag as they used evalSpecs still. The dependencies on PigContext and backend portion has been mostly taken care of. I would like the types branch to be updated to this patch asap. Also attached is a small shell script that I wrote to help me in creating patches. Its not in perfect shape taking all corner cases into consideration. If used as suggested, it should be helpful is what I feel. Please take a look and feel free to comment.

            People

            • Assignee:
              Pradeep Kamath
              Reporter:
              Alan Gates
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development