Pig
  1. Pig
  2. PIG-1337

Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.6.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The Zebra storage layer needs to use distributed cache to reduce name node load during job runs.

      To to this, Zebra needs to set up distributed cache related configuration information in TableLoader (which extends Pig's LoadFunc) .
      It is doing this within getSchema(conf). The problem is that the conf object here is not the one that is being serialized to map/reduce backend. As such, the distributed cache is not set up properly.

      To work over this problem, we need Pig in its LoadFunc to ensure a way that we can use to set up distributed cache information in a conf object, and this conf object is the one used by map/reduce backend.

        Issue Links

          Activity

          Hide
          Chao Wang added a comment -

          This may also relate to https://issues.apache.org/jira/browse/MAPREDUCE-1620
          "Hadoop should serialize the Configration after the call to getSplits() to the backend such that any changes to the Configuration in getSplits() is serialized to the backend"

          But a cleaner solution from Pig's side is still worthwhile - so we can just rely on Pig's front end only calls, like getSchema() to do the setup job.

          Show
          Chao Wang added a comment - This may also relate to https://issues.apache.org/jira/browse/MAPREDUCE-1620 "Hadoop should serialize the Configration after the call to getSplits() to the backend such that any changes to the Configuration in getSplits() is serialized to the backend" But a cleaner solution from Pig's side is still worthwhile - so we can just rely on Pig's front end only calls, like getSchema() to do the setup job.
          Hide
          Pradeep Kamath added a comment -

          My worry in doing these kinds of job related updates in the Job in getSchema() is that currently getSchema has been designed to be a pure getter without any indirect "set" side effects - this is noted in the javadoc:

              /**
               * Get a schema for the data to be loaded.  
               * @param location Location as returned by 
               * {@link LoadFunc#relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)}
               * @param job The {@link Job} object - this should be used only to obtain 
               * cluster properties through {@link Job#getConfiguration()} and not to set/query
               * any runtime job information.  
          ...
          

          We should be careful in opening this up to allow set capability - something to consider before designing a fix for this issue.

          Show
          Pradeep Kamath added a comment - My worry in doing these kinds of job related updates in the Job in getSchema() is that currently getSchema has been designed to be a pure getter without any indirect "set" side effects - this is noted in the javadoc: /** * Get a schema for the data to be loaded. * @param location Location as returned by * {@link LoadFunc#relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)} * @param job The {@link Job} object - this should be used only to obtain * cluster properties through {@link Job#getConfiguration()} and not to set/query * any runtime job information. ... We should be careful in opening this up to allow set capability - something to consider before designing a fix for this issue.
          Hide
          Chao Wang added a comment -

          It's ok for us not to use getSchema() for this purpose since it's a pure getter method.

          What we need is simply a setter method in LoadFunc through which we can set up distributed cache. Pig needs to ensure that this information is indeed in the job configuration variable that's being passed to hadoop backend.
          Also, this setter method should be only invoked at Pig's frondend. In the case of one m/r job containing multiple LoadFunc instances, Pig may need to combine distributed cache configuration information from all instances.

          Also, we note that using the UDFContext to convey information from frontend to backend is not working for this. We need the job configuration variable already contain all the distributed cache related information when it's being passed to the hadoop backend.

          Show
          Chao Wang added a comment - It's ok for us not to use getSchema() for this purpose since it's a pure getter method. What we need is simply a setter method in LoadFunc through which we can set up distributed cache. Pig needs to ensure that this information is indeed in the job configuration variable that's being passed to hadoop backend. Also, this setter method should be only invoked at Pig's frondend. In the case of one m/r job containing multiple LoadFunc instances, Pig may need to combine distributed cache configuration information from all instances. Also, we note that using the UDFContext to convey information from frontend to backend is not working for this. We need the job configuration variable already contain all the distributed cache related information when it's being passed to the hadoop backend.
          Hide
          Pradeep Kamath added a comment -

          We may need to add a new method - "addToDistributedCache()" on LoadFunc - notice this is an adder not a setter since there is only one key for distributed cache in hadoop's Job (Configuration in the Job). So implementations of loadfunc will have to use the DistributedCache.add*() methods.

          Show
          Pradeep Kamath added a comment - We may need to add a new method - "addToDistributedCache()" on LoadFunc - notice this is an adder not a setter since there is only one key for distributed cache in hadoop's Job (Configuration in the Job). So implementations of loadfunc will have to use the DistributedCache.add*() methods.
          Hide
          Scott Carey added a comment -

          Why not just allow a loader (or storer) the ability to set things on a conf object directly? DistributedCache won't be the only thing that I'll want access to. I don't think Pig will want to add new functions every time a Hadoop feature comes along that one wants access to.

          Right now, users can set anything they want with properties on the script command line, but have zero ability to set in compiled code! This seems backwards to me. A custom LoadFunc, or StoreFunc should just either have access to the configuration that gets serialized for the job, or, have the ability to return a Configuration object with settings it wishes Pig will pass on (Pig can then ignore or overwrite things that a user should never touch, similar to what happens from command line params).

          Perhaps either a:

          void configure(Configuration config);

          method or

          Configuration getCustomConfiguration();

          method would be great. The name for the loader and storer may have to differ as to not collide for classes that implement both, and they should not share the method since the disambiguation would be a problem (a load and store may not both want distributed cache, for example).

          Show
          Scott Carey added a comment - Why not just allow a loader (or storer) the ability to set things on a conf object directly? DistributedCache won't be the only thing that I'll want access to. I don't think Pig will want to add new functions every time a Hadoop feature comes along that one wants access to. Right now, users can set anything they want with properties on the script command line, but have zero ability to set in compiled code! This seems backwards to me. A custom LoadFunc, or StoreFunc should just either have access to the configuration that gets serialized for the job, or, have the ability to return a Configuration object with settings it wishes Pig will pass on (Pig can then ignore or overwrite things that a user should never touch, similar to what happens from command line params). Perhaps either a: void configure(Configuration config); method or Configuration getCustomConfiguration(); method would be great. The name for the loader and storer may have to differ as to not collide for classes that implement both, and they should not share the method since the disambiguation would be a problem (a load and store may not both want distributed cache, for example).
          Hide
          Alan Gates added a comment -

          The problem with allowing load and store functions access to the config file is that the config file they see is not the config file that goes to Hadoop. This is not all Pig's fault (see comments above on this). The other problem is that multiple instances of the same load and store function may be operating in a given script, so there are namespace issues to resolve.

          The proposal for Hadoop 0.22 is that rather than providing access to the config file at all Hadoop will serialize objects such as InputFormat and OutputFormat and pass those to the backend. It will make sense for Pig to follow suit and serialize all UDFs on the front end. This will remove the need for the UDFContext black magic that we do at the moment and should allow all UDFs to easily transfer information from front end to backend.

          So, hopefully this can get resolved when Pig migrates to Hadoop 0.22, whenever that is.

          Show
          Alan Gates added a comment - The problem with allowing load and store functions access to the config file is that the config file they see is not the config file that goes to Hadoop. This is not all Pig's fault (see comments above on this). The other problem is that multiple instances of the same load and store function may be operating in a given script, so there are namespace issues to resolve. The proposal for Hadoop 0.22 is that rather than providing access to the config file at all Hadoop will serialize objects such as InputFormat and OutputFormat and pass those to the backend. It will make sense for Pig to follow suit and serialize all UDFs on the front end. This will remove the need for the UDFContext black magic that we do at the moment and should allow all UDFs to easily transfer information from front end to backend. So, hopefully this can get resolved when Pig migrates to Hadoop 0.22, whenever that is.

            People

            • Assignee:
              Unassigned
              Reporter:
              Chao Wang
            • Votes:
              3 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development