Pig
  1. Pig
  2. PIG-3199

Provide a method to retriever name of loader/storer in PigServer

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.12.0
    • Component/s: impl
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Tags:
      api

      Description

      LogicalPlan could be exposed to user in order for one to make validations based on it. For eg, one could get Load/Store paths or other operators and be able to perform checks such as whether I/O paths are valid etc.

      1. PIG-3199.patch
        3 kB
        Prashant Kommireddi
      2. PIG-3199_2.patch
        9 kB
        Prashant Kommireddi

        Activity

        Daniel Dai made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Daniel Dai made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Resolution Fixed [ 1 ]
        Hide
        Daniel Dai added a comment -

        Patch committed to trunk.

        Show
        Daniel Dai added a comment - Patch committed to trunk.
        Daniel Dai made changes -
        Summary Expose LogicalPlan via PigServer API Provide a method to retriever name of loader/storer in PigServer
        Hide
        Daniel Dai added a comment -

        Looks fine for me. We are not exposing LogicalPlan, but only loader/storer name in the new patch. I will commit the patch with Cheolsoo's suggested change shortly.

        Show
        Daniel Dai added a comment - Looks fine for me. We are not exposing LogicalPlan, but only loader/storer name in the new patch. I will commit the patch with Cheolsoo's suggested change shortly.
        Hide
        Cheolsoo Park added a comment -

        The patch seems fine to me except the following line:

        +        assertTrue("Source must end with input.txt", lData.getSinks().get(0).endsWith("output"));
        

        It should say "Sink must end with output" instead.

        Nevertheless, I will wait for more feedback from other committers given this is quite controversial.

        Show
        Cheolsoo Park added a comment - The patch seems fine to me except the following line: + assertTrue( "Source must end with input.txt" , lData.getSinks().get(0).endsWith( "output" )); It should say "Sink must end with output" instead. Nevertheless, I will wait for more feedback from other committers given this is quite controversial.
        Hide
        Prashant Kommireddi added a comment -

        Ping!

        Show
        Prashant Kommireddi added a comment - Ping!
        Prashant Kommireddi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Tags minor, api api
        Prashant Kommireddi made changes -
        Attachment PIG-3199_2.patch [ 12577727 ]
        Hide
        Prashant Kommireddi added a comment -

        Uploading a new patch that returns a LogicalPlanData object that contains information regarding the LogicalPlan (sources, sinks, load/store func names). None of the pig internals are exposed, and any additional LP data can be added to this new class in the future. The idea behind this is that the information can be extracted before Pig moves on to generating Physical and MR plans.

        Show
        Prashant Kommireddi added a comment - Uploading a new patch that returns a LogicalPlanData object that contains information regarding the LogicalPlan (sources, sinks, load/store func names). None of the pig internals are exposed, and any additional LP data can be added to this new class in the future. The idea behind this is that the information can be extracted before Pig moves on to generating Physical and MR plans.
        Prashant Kommireddi made changes -
        Patch Info Patch Available [ 10042 ]
        Hide
        Prashant Kommireddi added a comment -

        Dmitriy V. Ryaboy please let me know if you think otherwise?

        Show
        Prashant Kommireddi added a comment - Dmitriy V. Ryaboy please let me know if you think otherwise?
        Hide
        Prashant Kommireddi added a comment -

        Mainly because we don't need the extra steps of running the optimizer, generating Physical plan, generating MR plan to get to this information. It just feels querying LP for source/sink or load/store funcs is more efficient. Would be happy to get your thoughts?

        Show
        Prashant Kommireddi added a comment - Mainly because we don't need the extra steps of running the optimizer, generating Physical plan, generating MR plan to get to this information. It just feels querying LP for source/sink or load/store funcs is more efficient. Would be happy to get your thoughts?
        Hide
        Dmitriy V. Ryaboy added a comment -

        Seriously though, why not PPNL?

        Show
        Dmitriy V. Ryaboy added a comment - Seriously though, why not PPNL?
        Prashant Kommireddi made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Alan Gates added a comment -

        Keep this one, that way the history of the discussion is all together.

        Show
        Alan Gates added a comment - Keep this one, that way the history of the discussion is all together.
        Hide
        Prashant Kommireddi added a comment -

        That makes sense, I will work on implementing the above proposal. Thanks Alan.

        Should we change this JIRA title to reflect that or open a new one?

        Show
        Prashant Kommireddi added a comment - That makes sense, I will work on implementing the above proposal. Thanks Alan. Should we change this JIRA title to reflect that or open a new one?
        Hide
        Alan Gates added a comment -

        If the PigNotificationListener doesn't work for you I think getLoadPaths()/getStorePaths() is fine. I was thinking of proposing that when I remembered the initialPlanNotification stuff. You might want to return a class so it can contain the names of the load/store funcs as well and so you can include more info later.

        Show
        Alan Gates added a comment - If the PigNotificationListener doesn't work for you I think getLoadPaths()/getStorePaths() is fine. I was thinking of proposing that when I remembered the initialPlanNotification stuff. You might want to return a class so it can contain the names of the load/store funcs as well and so you can include more info later.
        Hide
        Prashant Kommireddi added a comment -

        What are your thoughts on exposing the Load/Store paths as strings instead of exposing any of the internal interfaces? That way we have a very neat way of exposing these to users

        public List<String> getLoadPaths();
        public List<String> getStorePaths();
        
        Show
        Prashant Kommireddi added a comment - What are your thoughts on exposing the Load/Store paths as strings instead of exposing any of the internal interfaces? That way we have a very neat way of exposing these to users public List< String > getLoadPaths(); public List< String > getStorePaths();
        Hide
        Alan Gates added a comment -

        Take a look at PigNotificationListener.initialPlanNotification. It seems like this will give you what you want, since each of the sources and sinks for the MR jobs are in here. To my chagrin this exposes the MR plan. Someone snuck that past me.

        Show
        Alan Gates added a comment - Take a look at PigNotificationListener.initialPlanNotification. It seems like this will give you what you want, since each of the sources and sinks for the MR jobs are in here. To my chagrin this exposes the MR plan. Someone snuck that past me.
        Hide
        Prashant Kommireddi added a comment -

        Ok, I can understand the concerns and agree with you on not exposing LP.

        What I stated earlier is exactly the info I need - using PigServer register script/query and be able to determine the source/sinks. For eg, in the following snippet I read a script "script.pig". I don't control the contents of the script (written by other users) and as an admin would like to make sure I can read the I/O paths being used here.

        pigServer = new PigServer(ExecType.MAPREDUCE, conf);
        InputStream in = fs.open("/foo/script.pig");
        pigServer.registerScript(in);
        

        script.pig

        A = load '/apache/pig/*';
        B = store A into '/google';
        

        The idea with this patch was to be able to determine source (/apache/pig) and sink (/google) and perform certain operations on it. Would be great if you think it can be done in a way better than exposing any of LP/Operator.

        Show
        Prashant Kommireddi added a comment - Ok, I can understand the concerns and agree with you on not exposing LP. What I stated earlier is exactly the info I need - using PigServer register script/query and be able to determine the source/sinks. For eg, in the following snippet I read a script "script.pig". I don't control the contents of the script (written by other users) and as an admin would like to make sure I can read the I/O paths being used here. pigServer = new PigServer(ExecType.MAPREDUCE, conf); InputStream in = fs.open( "/foo/script.pig" ); pigServer.registerScript(in); script.pig A = load '/apache/pig/*'; B = store A into '/google'; The idea with this patch was to be able to determine source (/apache/pig) and sink (/google) and perform certain operations on it. Would be great if you think it can be done in a way better than exposing any of LP/Operator.
        Hide
        Alan Gates added a comment -

        Even making Operator public is dangerous. These are internal structures.

        It would help to understand who you want to expose these to and why. Then we can see if there's a way to get you the information you need. I don't want to stand in the way of innovation but I also don't want Pig's internals exposed to the point that the next time we make a change to our Operator class someone complains because we broke his tool.

        Show
        Alan Gates added a comment - Even making Operator public is dangerous. These are internal structures. It would help to understand who you want to expose these to and why. Then we can see if there's a way to get you the information you need. I don't want to stand in the way of innovation but I also don't want Pig's internals exposed to the point that the next time we make a change to our Operator class someone complains because we broke his tool.
        Hide
        Prashant Kommireddi added a comment -

        It's not exposed at the moment but a part of this patch. I was not aware of plans to make changes to LogicalPlan at the interface level. Just noticed this for OperatorPlan and it makes sense.

        @InterfaceAudience.Private
        @InterfaceStability.Unstable
        

        If that's the case, can we expose only a couple methods to get a handle on source/sinks?

        public List<Operator> getSources();
        
        public List<Operator> getSinks();
        
        Show
        Prashant Kommireddi added a comment - It's not exposed at the moment but a part of this patch. I was not aware of plans to make changes to LogicalPlan at the interface level. Just noticed this for OperatorPlan and it makes sense. @InterfaceAudience.Private @InterfaceStability.Unstable If that's the case, can we expose only a couple methods to get a handle on source/sinks? public List<Operator> getSources(); public List<Operator> getSinks();
        Hide
        Jonathan Coveney added a comment -

        Alan,

        This patch almost exclusively exists to make the LogicalPlan publicly available.

        Prashant,

        Want to discuss further how you want to handle this? I agree that we don't want to put ourselves into a difficult position when it comes to changing things in the future, but maybe there's a compromise.

        Show
        Jonathan Coveney added a comment - Alan, This patch almost exclusively exists to make the LogicalPlan publicly available. Prashant, Want to discuss further how you want to handle this? I agree that we don't want to put ourselves into a difficult position when it comes to changing things in the future, but maybe there's a compromise.
        Hide
        Alan Gates added a comment -

        When you say "now that [the logical plan] is public" do you mean that's already true or it would be true with this patch? If it's already true, where are we exposing it? If it's not true yet, I'm -1 at this point in exposing it. Making that a public interface will severely restrict our ability to make changes at that layer, which we'd like to be to do.

        Show
        Alan Gates added a comment - When you say "now that [the logical plan] is public" do you mean that's already true or it would be true with this patch? If it's already true, where are we exposing it? If it's not true yet, I'm -1 at this point in exposing it. Making that a public interface will severely restrict our ability to make changes at that layer, which we'd like to be to do.
        Hide
        Jonathan Coveney added a comment -

        Given that a logical plan is only available under certain conditions, now that it is public it's probably worth testing for those conditions and throwing an error if they are not met.

        Aside from that this is a pretty small but potentially useful change.

        Show
        Jonathan Coveney added a comment - Given that a logical plan is only available under certain conditions, now that it is public it's probably worth testing for those conditions and throwing an error if they are not met. Aside from that this is a pretty small but potentially useful change.
        Prashant Kommireddi made changes -
        Patch Info Patch Available [ 10042 ]
        Prashant Kommireddi made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Prashant Kommireddi made changes -
        Field Original Value New Value
        Attachment PIG-3199.patch [ 12570008 ]
        Hide
        Prashant Kommireddi added a comment -

        Adding a new method getLogicalPlan() to PigServer that retrieves the lp associated with current DAG.

        Marking Graph.getLogicalPlan() as private since there is no calls being made to it within Pig, and Graph itself is protected. Please let me know if there are any objections to that.

        Also, removing an unnecessary annotation

        @SuppressWarnings("unchecked")

        Added a test testGetLogicalPlan() to TestPigServer.

        Show
        Prashant Kommireddi added a comment - Adding a new method getLogicalPlan() to PigServer that retrieves the lp associated with current DAG. Marking Graph.getLogicalPlan() as private since there is no calls being made to it within Pig, and Graph itself is protected. Please let me know if there are any objections to that. Also, removing an unnecessary annotation @SuppressWarnings( "unchecked" ) Added a test testGetLogicalPlan() to TestPigServer.
        Prashant Kommireddi created issue -

          People

          • Assignee:
            Prashant Kommireddi
            Reporter:
            Prashant Kommireddi
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development