Thanks for the comments Pi.
1) First concern is that using Hadoop Local will tie us to Hadoop too much.
There was an initiative quite a while ago to start looking at different backends other than Hadoop (e.g. we might be running a backend like SETI@home. Who knows?).
However, this whole thing seems to have been built for solely Hadoop anyway. Not sure about the current direction.
[shrav] I don't think this ties us down to Hadoop in the sense that we can't have other backends. We just resue some hadoop code thats all. The only thing I see tied to haddop is that at max we would need to supply the hadoop jar with pig which we already do.
2) Have you tried to measure LocalHadoop startup time compared to the local engine? If the LocalHadoop takes much more time to startup, we might suffer when processing nested queries.
[shrav] The LoaclHadoop has a startup time of about 6 secs. But if we are processing even like 10 MB of data, the LocalHadoop mysteriously beats the local engine hands down. For the local engine I presumed that it would just take the leaf operator which will be a POStore and call the store() method.
For about 12MB of data, the LocalHadoop took about 11 sec whereas the local engine took about 15 sec.
As far as the nested plan in foreach goes, at least currently, we won't be creating an instance of a local engine to run the nested plan. Currently, all operators that can be used inside the nested plan have been implemented such that the generic plan execution model with attachInputs called on the inner plan will work fine. However, if we decide to have all the operators inside the nested plan, then we will have to do changes to the MRCompiler so that the nested foreach becomes a blocking operator and should be handled separately by spawning new MR jobs to process the plan inside. In this case, invoking LocalHadoop would probably not make sense. The executable operator plan is a better option here as it would also entail that there would not be any changes to the MRCompiler.
So, at least now, LocalJobRunner will not be invoked inside the MapReduce execution for executing nested plans. The LocalJobRunner will be strictly used only when the user is in local execution mode.
I will update the wiki with these comments.
Thanks for the inputs Pi. I had not thought about the nested for each when it grows full blown.