Pig
  1. Pig
  2. PIG-3441

Allow Pig to use default resources from Configuration objects

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11.1
    • Fix Version/s: None
    • Component/s: impl
    • Labels:
      None

      Description

      Pig currently ignores parameters from configuration files added statically to Configuration objects as Configuration.addDefaultResource(filename.xml).

      Consider the following scenario -
      In a hadoop FileSystem driver for a non-HDFS filesystem you load properties specific to that FileSystem in a static initializer block in the class that extends org.apache.hadoop.fs.Filesystem for your FileSystem like below -

      
      class MyFileSystem extends FileSystem {
      
              static {
      		Configuration.addDefaultResource("myfs-default.xml");
      		Configuration.addDefaultResource("myfs-site.xml");
      	}
      
      
      }
      

      Interfaces like the Hadoop CLI, Hive, Hadoop M/R can find configuration parameters defined in these configuration files as long as they are on the classpath.

      However, Pig cannot find parameters from these files, because it ignores configuration files added statically.

      Pig should allow users to specify if they would like pig to read parameters from resources loaded statically.

      1. PIG-3441_1.patch
        4 kB
        Bhooshan Mogal
      2. PIG-3441.patch
        4 kB
        Bhooshan Mogal
      3. PIG-3441-2.patch
        17 kB
        Daniel Dai
      4. PIG-3441-3.patch
        18 kB
        Daniel Dai

        Activity

        Bhooshan Mogal created issue -
        Hide
        Bhooshan Mogal added a comment -

        Attaching a patch that allows users to specify if they want Pig to read parameters from default resources in Configuration objects. With this patch, users can enable default resources by setting the parameter pig.load.default.resources to true. It is set to false by default.

        Please review.

        I have currently added unit tests in the TestHExecutionEngine.java class. Please let me know there is a better place to add these tests.

        Show
        Bhooshan Mogal added a comment - Attaching a patch that allows users to specify if they want Pig to read parameters from default resources in Configuration objects. With this patch, users can enable default resources by setting the parameter pig.load.default.resources to true. It is set to false by default. Please review. I have currently added unit tests in the TestHExecutionEngine.java class. Please let me know there is a better place to add these tests.
        Bhooshan Mogal made changes -
        Field Original Value New Value
        Attachment PIG-3441.patch [ 12600322 ]
        Hide
        Bhooshan Mogal added a comment -

        Created another patch with --no-prefix. No code changes from previous patch.

        Show
        Bhooshan Mogal added a comment - Created another patch with --no-prefix. No code changes from previous patch.
        Bhooshan Mogal made changes -
        Attachment PIG-3441.patch_1 [ 12600329 ]
        Bhooshan Mogal made changes -
        Attachment PIG-3441.patch_1 [ 12600329 ]
        Bhooshan Mogal made changes -
        Attachment PIG-3441_1.patch [ 12600330 ]
        Bhooshan Mogal made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Daniel Dai added a comment -

        Wonder if you can use the mechanism in PIG-3441 to achieve it?

        Show
        Daniel Dai added a comment - Wonder if you can use the mechanism in PIG-3441 to achieve it?
        Hide
        Bhooshan Mogal added a comment -

        Hi Daniel,

        You seem to have pointed to the wrong jira.

        -
        Bhooshan

        Show
        Bhooshan Mogal added a comment - Hi Daniel, You seem to have pointed to the wrong jira. - Bhooshan
        Hide
        Daniel Dai added a comment -

        Oh, sorry, I mean PIG-3135

        Show
        Daniel Dai added a comment - Oh, sorry, I mean PIG-3135
        Hide
        Bhooshan Mogal added a comment -

        Thanks. Yes, I tried https://issues.apache.org/jira/browse/PIG-3135 before. I had a email thread with Prashant regarding this issue. PIG-3135 is for allowing pig to use standard configuration files (core-site, hdfs-site) from a different location. This issue is regarding custom configuration files that one may wish to use. One of the usecases could be the one I mentioned in the description, where a hadoop driver for a storage system (MyFileSystem.java) loads its own configuration files statically. Parameters from configuration files loaded statically in such scenarios are not available in pig scripts/grunt shell.

        Here is a link to the email thread for more context - http://mail-archives.apache.org/mod_mbox/pig-user/201304.mbox/%3CCA+Di8eDdEn65EQmPrp6tT9A9bJ8bhCwhn5ifH7hAM4weA3boGQ@mail.gmail.com%3E

        Show
        Bhooshan Mogal added a comment - Thanks. Yes, I tried https://issues.apache.org/jira/browse/PIG-3135 before. I had a email thread with Prashant regarding this issue. PIG-3135 is for allowing pig to use standard configuration files (core-site, hdfs-site) from a different location. This issue is regarding custom configuration files that one may wish to use. One of the usecases could be the one I mentioned in the description, where a hadoop driver for a storage system (MyFileSystem.java) loads its own configuration files statically. Parameters from configuration files loaded statically in such scenarios are not available in pig scripts/grunt shell. Here is a link to the email thread for more context - http://mail-archives.apache.org/mod_mbox/pig-user/201304.mbox/%3CCA+Di8eDdEn65EQmPrp6tT9A9bJ8bhCwhn5ifH7hAM4weA3boGQ@mail.gmail.com%3E
        Hide
        Daniel Dai added a comment -

        Seems you can pass custom configuration to Pig the same way PIG-3135 does:

        Configuration conf = new Configuration(true); // Which takes your default resource
        PigContext pigContext = new PigContext(ExecType.MAPREDUCE, conf);
        

        Anything I miss?

        Show
        Daniel Dai added a comment - Seems you can pass custom configuration to Pig the same way PIG-3135 does: Configuration conf = new Configuration( true ); // Which takes your default resource PigContext pigContext = new PigContext(ExecType.MAPREDUCE, conf); Anything I miss?
        Hide
        Bhooshan Mogal added a comment -

        Could you please tell me where the code snippet you mentioned is from?

        A lot of code-flows in pig seem to re-create configuration objects by loading from Properties as well in the ConfigurationUtil.toConfiguration() method.

        In this method, I saw that default resources are ignored as -

        public static Configuration toConfiguration(Properties properties) {
                assert properties != null;
                final Configuration config = new Configuration(false);
                final Enumeration<Object> iter = properties.keys();
        ...
        

        Due to this, Pig was unable to read from custom resources added statically. The patch addresses this by allowing users to create the Configuration object in this method with loadDefaults set to true, based on a pig property.

        Show
        Bhooshan Mogal added a comment - Could you please tell me where the code snippet you mentioned is from? A lot of code-flows in pig seem to re-create configuration objects by loading from Properties as well in the ConfigurationUtil.toConfiguration() method. In this method, I saw that default resources are ignored as - public static Configuration toConfiguration(Properties properties) { assert properties != null ; final Configuration config = new Configuration( false ); final Enumeration< Object > iter = properties.keys(); ... Due to this, Pig was unable to read from custom resources added statically. The patch addresses this by allowing users to create the Configuration object in this method with loadDefaults set to true, based on a pig property.
        Hide
        Daniel Dai added a comment -

        This is from the test case of PIG-3135. If the custom configuration get lost along the way, I wonder why PIG-3135 works. Seems they should share the same issue.

        Show
        Daniel Dai added a comment - This is from the test case of PIG-3135 . If the custom configuration get lost along the way, I wonder why PIG-3135 works. Seems they should share the same issue.
        Hide
        Bhooshan Mogal added a comment -

        The test cases of PIG-3135 pass a false to the Configuration constructor.

        Also, PIG-3135 calls ConfigurationUtil.toConfiguration as well, which creates the configuration object with loadDefaults set to false like in my previous comment.

        I tried using PIG-3135 and setting pig.use.overriden.hadoop.configs to true, however, pig did not read from the custom configuration files added statically. When I changed it to set loadDefaults to true, it worked fine.

        Show
        Bhooshan Mogal added a comment - The test cases of PIG-3135 pass a false to the Configuration constructor. Also, PIG-3135 calls ConfigurationUtil.toConfiguration as well, which creates the configuration object with loadDefaults set to false like in my previous comment. I tried using PIG-3135 and setting pig.use.overriden.hadoop.configs to true, however, pig did not read from the custom configuration files added statically. When I changed it to set loadDefaults to true, it worked fine.
        Hide
        Daniel Dai added a comment -

        I am not doubting your case does not work, but curious to know why PIG-3135 works. Seems both are trying to pass a custom configuration in. In PIG-3135, it pass some handcoded entries, and you want construct Configuration(true), both then pass the config object to Pig. If Pig does take the config object, then both case work, if not, both case fail. I do want to solve both issue in one consistent way if possible.

        Show
        Daniel Dai added a comment - I am not doubting your case does not work, but curious to know why PIG-3135 works. Seems both are trying to pass a custom configuration in. In PIG-3135 , it pass some handcoded entries, and you want construct Configuration(true), both then pass the config object to Pig. If Pig does take the config object, then both case work, if not, both case fail. I do want to solve both issue in one consistent way if possible.
        Hide
        Bhooshan Mogal added a comment -

        I sort of see your point now. It seems like PIG-3135 would work only if resources are added to Configuration objects as confObject.addResource() and not Configuration.addDefaultResource(), since loadDefaults is set to false in ConfigurationUtil.toConfiguration()?

        Unless the standard configuration files are added as configObject.addResource() somewhere in the code?

        Show
        Bhooshan Mogal added a comment - I sort of see your point now. It seems like PIG-3135 would work only if resources are added to Configuration objects as confObject.addResource() and not Configuration.addDefaultResource(), since loadDefaults is set to false in ConfigurationUtil.toConfiguration()? Unless the standard configuration files are added as configObject.addResource() somewhere in the code?
        Hide
        Bhooshan Mogal added a comment -

        Daniel, any updates regarding this? Do you think there is a better way we could fix this issue?

        Show
        Bhooshan Mogal added a comment - Daniel, any updates regarding this? Do you think there is a better way we could fix this issue?
        Hide
        Bhooshan Mogal added a comment -

        Daniel Dai Prashant Kommireddi any thoughts on this?

        Show
        Bhooshan Mogal added a comment - Daniel Dai Prashant Kommireddi any thoughts on this?
        Hide
        Daniel Dai added a comment -

        Bhooshan Mogal, I need to make sure we are talking about the same thing. Here is the thing I know should work:

        1. If you are embedding Pig in Java, you can pass a conf object to PigServer:

        Configuration conf = ..... // make conf whatever you want, include Configuration.addDefaultResource()
        PigServer pServer = new PigServer(ExecType.LOCAL, conf);  // conf will be materialized and pass to Pig
        

        Pig will take the configuration you pass

        2. Sequence of the config:
        Pig currently will use its predefined config file to override the conf object you pass to PigServer, so if those config files contains the entry in your conf object, it will get overridden. PIG-3135 address this issue by setting a flag not override those entries, but this requires conf object contains every entries needed (include those should be in core-site, hdfs-site, mapred-site)

        So in your case, if you use Pig embedding, you can instantiate the conf object and pass to Pig. If you are not using Pig embedding, then things could be different. Let me know if that is the case.

        Show
        Daniel Dai added a comment - Bhooshan Mogal , I need to make sure we are talking about the same thing. Here is the thing I know should work: 1. If you are embedding Pig in Java, you can pass a conf object to PigServer: Configuration conf = ..... // make conf whatever you want, include Configuration.addDefaultResource() PigServer pServer = new PigServer(ExecType.LOCAL, conf); // conf will be materialized and pass to Pig Pig will take the configuration you pass 2. Sequence of the config: Pig currently will use its predefined config file to override the conf object you pass to PigServer, so if those config files contains the entry in your conf object, it will get overridden. PIG-3135 address this issue by setting a flag not override those entries, but this requires conf object contains every entries needed (include those should be in core-site, hdfs-site, mapred-site) So in your case, if you use Pig embedding, you can instantiate the conf object and pass to Pig. If you are not using Pig embedding, then things could be different. Let me know if that is the case.
        Hide
        Bhooshan Mogal added a comment -

        Daniel Dai,

        Apologies for the delayed response. I am not using Pig embedding. I am using a pig script to talk to a hadoop filesystem implementation. This implementation loads configuration resources by statically adding them as Configuration.addDefaultResource('filename.xml').

        I think the sequence you mentioned in your comment probably explains this issue. It could be that since these resources are not part of the conf object (as I said earlier, they are added statically), PIG-3135 does not help fix this issue.

        Could you help me understand why Pig expects the resources to be in every object, and ignores the ones added statically?

        Show
        Bhooshan Mogal added a comment - Daniel Dai , Apologies for the delayed response. I am not using Pig embedding. I am using a pig script to talk to a hadoop filesystem implementation. This implementation loads configuration resources by statically adding them as Configuration.addDefaultResource('filename.xml'). I think the sequence you mentioned in your comment probably explains this issue. It could be that since these resources are not part of the conf object (as I said earlier, they are added statically), PIG-3135 does not help fix this issue. Could you help me understand why Pig expects the resources to be in every object, and ignores the ones added statically?
        Hide
        Bhooshan Mogal added a comment -

        Daniel Dai any thoughts?

        Show
        Bhooshan Mogal added a comment - Daniel Dai any thoughts?
        Hide
        Daniel Dai added a comment -

        Do you use Pig command line to run Pig? Sounds like you are not, since you want to invoke Configuration.addDefaultResource('filename.xml') in your Java code?

        Show
        Daniel Dai added a comment - Do you use Pig command line to run Pig? Sounds like you are not, since you want to invoke Configuration.addDefaultResource('filename.xml') in your Java code?
        Hide
        Bhooshan Mogal added a comment -

        Yes, I use the Pig command line to run Pig.

        The Configuration.addDefaultResource('filename.xml') is in the java code of my filesystem implementation. Its a class that extends org.apache.hadoop.fs.FileSystem. I am trying to use this filesystem in my pig script.

        Show
        Bhooshan Mogal added a comment - Yes, I use the Pig command line to run Pig. The Configuration.addDefaultResource('filename.xml') is in the java code of my filesystem implementation. Its a class that extends org.apache.hadoop.fs.FileSystem. I am trying to use this filesystem in my pig script.
        Hide
        Daniel Dai added a comment -

        I see. Not sure how do you tell Hadoop to take filename.xml, are you able to run mapreduce with filename.xml?

        Show
        Daniel Dai added a comment - I see. Not sure how do you tell Hadoop to take filename.xml, are you able to run mapreduce with filename.xml?
        Hide
        Bhooshan Mogal added a comment -

        Yes, we have the Hadoop CLI, map-reduce jobs, hive, streaming all working. Even Pig works, but it requires us to add the resource to every instance like conf.addResource('filename.xml'). Configuration.addDefaultResource('filename.xml') does not work because Pig seems to create configuration objects by passing a false for the loadDefaults member in Configuration.java.

        To your other question, filename.xml is in the HADOOP_CLASSPATH, so hadoop can find it.

        Show
        Bhooshan Mogal added a comment - Yes, we have the Hadoop CLI, map-reduce jobs, hive, streaming all working. Even Pig works, but it requires us to add the resource to every instance like conf.addResource('filename.xml') . Configuration.addDefaultResource('filename.xml') does not work because Pig seems to create configuration objects by passing a false for the loadDefaults member in Configuration.java. To your other question, filename.xml is in the HADOOP_CLASSPATH, so hadoop can find it.
        Hide
        Daniel Dai added a comment -

        I know filename.xml is in the HADOOP_CLASSPATH, but how Hadoop know it need to pickup filename.xml? You add Configuration.addDefaultResource('filename.xml') in your mapreduce program?

        Show
        Daniel Dai added a comment - I know filename.xml is in the HADOOP_CLASSPATH, but how Hadoop know it need to pickup filename.xml? You add Configuration.addDefaultResource('filename.xml') in your mapreduce program?
        Hide
        Bhooshan Mogal added a comment -

        Like in the description of this jira, I have a class that extends FileSystem.java for a particular filesystem - Say MyFileSystem.java. Say this class handles URIs of the form xyz://... I have Configuration.addDefaultResource('filename.xml') in this class.
        I have also set fs.xyz.impl to MyFileSystem in core-site.xml. As a result, in mapreduce programs, hadoop CLI, hive, pig can load up MyFileSystem.java when they see URLs like xyz://... in input/output paths.

        Show
        Bhooshan Mogal added a comment - Like in the description of this jira, I have a class that extends FileSystem.java for a particular filesystem - Say MyFileSystem.java. Say this class handles URIs of the form xyz://... I have Configuration.addDefaultResource('filename.xml') in this class. I have also set fs.xyz.impl to MyFileSystem in core-site.xml. As a result, in mapreduce programs, hadoop CLI, hive, pig can load up MyFileSystem.java when they see URLs like xyz://... in input/output paths.
        Hide
        Daniel Dai added a comment -

        Get your point. Pig does not use the default config files, HExecutionEngine.getExecConf hardcode the config entries. This part need to be changed.

        Show
        Daniel Dai added a comment - Get your point. Pig does not use the default config files, HExecutionEngine.getExecConf hardcode the config entries. This part need to be changed.
        Hide
        Bhooshan Mogal added a comment -

        Thanks Daniel. The patch I had attached sometime ago does fix the problem. Please let me know if there is a better way to fix this, I could work with you to get it done.

        Show
        Bhooshan Mogal added a comment - Thanks Daniel. The patch I had attached sometime ago does fix the problem. Please let me know if there is a better way to fix this, I could work with you to get it done.
        Hide
        Daniel Dai added a comment -

        Bhooshan Mogal I see several more places we instantiate Configuration without default config files (eg, HExecutionEngine:111), not sure if we need to change those as well. Need to dig into the configuration propagation process more, it is quite complicated right now.

        Show
        Daniel Dai added a comment - Bhooshan Mogal I see several more places we instantiate Configuration without default config files (eg, HExecutionEngine:111), not sure if we need to change those as well. Need to dig into the configuration propagation process more, it is quite complicated right now.
        Hide
        Bhooshan Mogal added a comment -

        Daniel Dai, yes, I agree. There are a lot of places where Configuration objects are re-created in Pig. I tried a bunch of them, but this particular instance - ConfigurationUtil.toConfiguration() seemed to fix the problem for me. This method is also called at multiple places.

        Show
        Bhooshan Mogal added a comment - Daniel Dai , yes, I agree. There are a lot of places where Configuration objects are re-created in Pig. I tried a bunch of them, but this particular instance - ConfigurationUtil.toConfiguration() seemed to fix the problem for me. This method is also called at multiple places.
        Hide
        Mona Chitnis added a comment -

        I see TezLauncher is using the boolean loadDefaults as true. So we can apply same thing in MapReduceLauncher in addition to Daniel's comment about HExecutionEngine:111

                 @Override
            public PigStats launchPig(PhysicalPlan php, String grpName, PigContext pc) throws Exception {
                aggregateWarning = Boolean.parseBoolean(pc.getProperties().getProperty("aggregate.warning", "false"));
                Configuration conf = ConfigurationUtil.toConfiguration(pc.getProperties(), true);
        
        Show
        Mona Chitnis added a comment - I see TezLauncher is using the boolean loadDefaults as true. So we can apply same thing in MapReduceLauncher in addition to Daniel's comment about HExecutionEngine:111 @Override public PigStats launchPig(PhysicalPlan php, String grpName, PigContext pc) throws Exception { aggregateWarning = Boolean .parseBoolean(pc.getProperties().getProperty( "aggregate.warning" , " false " )); Configuration conf = ConfigurationUtil.toConfiguration(pc.getProperties(), true );
        Hide
        Bhooshan Mogal added a comment -

        Mona Chitnis, thanks for looking into this issue. If Daniel Dai also agrees that these are the only 3 places that need to be changed, I can submit an updated patch.

        The initial patch allows users to configure this behavior based on a property, do you think it is necessary to make this configurable or should we set loadDefaults to true always?

        Show
        Bhooshan Mogal added a comment - Mona Chitnis , thanks for looking into this issue. If Daniel Dai also agrees that these are the only 3 places that need to be changed, I can submit an updated patch. The initial patch allows users to configure this behavior based on a property, do you think it is necessary to make this configurable or should we set loadDefaults to true always?
        Hide
        Daniel Dai added a comment -

        I am not absolutely sure but if we are using in Tez, it is probably Ok. We can follow what Mona suggested to change MapReduceLauncher.launchPig. We don't need to use a flag, but ConfigurationUtil.toConfiguration still shall by default not load defaults.

        Show
        Daniel Dai added a comment - I am not absolutely sure but if we are using in Tez, it is probably Ok. We can follow what Mona suggested to change MapReduceLauncher.launchPig. We don't need to use a flag, but ConfigurationUtil.toConfiguration still shall by default not load defaults.
        Hide
        Bhooshan Mogal added a comment -

        Hi Daniel Dai,

        I tried changing MapReduceLauncher.launchPig as

        -        Configuration conf = ConfigurationUtil.toConfiguration(pc.getProperties());
        +        Configuration conf = ConfigurationUtil.toConfiguration(pc.getProperties(), true);
        

        However, the usecase still does not work. Like before, it ignores default resources. I was about to test by changing ConfigurationUtil.toConfiguration as

        -        return toConfiguration(properties, false);
        +        return toConfiguration(properties, true);
        

        However, looks like the Pig trunk build broke due to this commit in Tez yesterday - https://github.com/apache/incubator-tez/commit/97805278ed9b7e9f8a20521574b22515d1c553e7

        A bunch of constants that Pig still refers to got removed from TezJobConfig in this commit.

        Is there a ticket for this already? If not, I can create one.

        Thanks,
        Bhooshan

        Show
        Bhooshan Mogal added a comment - Hi Daniel Dai , I tried changing MapReduceLauncher.launchPig as - Configuration conf = ConfigurationUtil.toConfiguration(pc.getProperties()); + Configuration conf = ConfigurationUtil.toConfiguration(pc.getProperties(), true ); However, the usecase still does not work. Like before, it ignores default resources. I was about to test by changing ConfigurationUtil.toConfiguration as - return toConfiguration(properties, false ); + return toConfiguration(properties, true ); However, looks like the Pig trunk build broke due to this commit in Tez yesterday - https://github.com/apache/incubator-tez/commit/97805278ed9b7e9f8a20521574b22515d1c553e7 A bunch of constants that Pig still refers to got removed from TezJobConfig in this commit. Is there a ticket for this already? If not, I can create one. Thanks, Bhooshan
        Hide
        Bhooshan Mogal added a comment -

        Hi Daniel Dai,

        My pig build succeeded now. I just double checked, just making the change in MapreduceLauncher.java does not fix the usecase, the default resources are still not available. However, when I change ConfigurationUtil.toConfiguration as in the previous comment, it works.

        Thanks,
        Bhooshan

        Show
        Bhooshan Mogal added a comment - Hi Daniel Dai , My pig build succeeded now. I just double checked, just making the change in MapreduceLauncher.java does not fix the usecase, the default resources are still not available. However, when I change ConfigurationUtil.toConfiguration as in the previous comment, it works. Thanks, Bhooshan
        Hide
        Bhooshan Mogal added a comment -

        Daniel Dai, any thoughts on this?

        Show
        Bhooshan Mogal added a comment - Daniel Dai , any thoughts on this?
        Hide
        Daniel Dai added a comment -

        Sorry for late reply. It's probably ok for a temporary fix. Let me run some tests to verify.

        Show
        Daniel Dai added a comment - Sorry for late reply. It's probably ok for a temporary fix. Let me run some tests to verify.
        Hide
        Bhooshan Mogal added a comment -

        Thanks. Let me know if you want me to generate an updated patch for trunk.

        Show
        Bhooshan Mogal added a comment - Thanks. Let me know if you want me to generate an updated patch for trunk.
        Hide
        Daniel Dai added a comment -

        Finally get a chance to look into it. What I find is HExecutionEngine.init will instantiate the Configuration with defaults, and put all entries into PigContext.properites. Every time Pig needs a Configuration object, it will create a Configuration object with no defaults and inject PigContext.properites, and thus Configuration should contains default resources. My guess for this issue is at the time Pig invokes HExecutionEngine.init, Configuration.addDefaultResource("myfs-site.xml") it not hit yet, so we miss it in PigContext.properites. I'd like to fix the issue by updating PigContext.properites after the new default resources is added. Attached is a patch which update PigContext.properites again before we launch Pig job (the patch also includes several clean up for the configuration part). Can you try if it works for your issue?

        Show
        Daniel Dai added a comment - Finally get a chance to look into it. What I find is HExecutionEngine.init will instantiate the Configuration with defaults, and put all entries into PigContext.properites. Every time Pig needs a Configuration object, it will create a Configuration object with no defaults and inject PigContext.properites, and thus Configuration should contains default resources. My guess for this issue is at the time Pig invokes HExecutionEngine.init, Configuration.addDefaultResource("myfs-site.xml") it not hit yet, so we miss it in PigContext.properites. I'd like to fix the issue by updating PigContext.properites after the new default resources is added. Attached is a patch which update PigContext.properites again before we launch Pig job (the patch also includes several clean up for the configuration part). Can you try if it works for your issue?
        Daniel Dai made changes -
        Attachment PIG-3441-2.patch [ 12665846 ]
        Daniel Dai made changes -
        Assignee Daniel Dai [ daijy ]
        Hide
        Daniel Dai added a comment -

        Bhooshan Mogal, can you give a try?

        Show
        Daniel Dai added a comment - Bhooshan Mogal , can you give a try?
        Daniel Dai made changes -
        Attachment PIG-3441-3.patch [ 12666377 ]
        Hide
        Bhooshan Mogal added a comment -

        Hi Daniel Dai,

        Yes, this patch seems to fix the issue for me. I'm doing some more tests and will get back to you with the results soon.

        Sorry for the delay, I was repeatedly hitting http://0x01af.blogspot.com/2014/07/mavenivy-considered-harmful-1consider.html in my pig build. Finally managed to get around it.

        Thanks,
        Bhooshan

        Show
        Bhooshan Mogal added a comment - Hi Daniel Dai , Yes, this patch seems to fix the issue for me. I'm doing some more tests and will get back to you with the results soon. Sorry for the delay, I was repeatedly hitting http://0x01af.blogspot.com/2014/07/mavenivy-considered-harmful-1consider.html in my pig build. Finally managed to get around it. Thanks, Bhooshan
        Hide
        Bhooshan Mogal added a comment -

        Apologies, please disregard my previous comment. I can still see the same issue even after applying this patch. I forgot that I had a workaround in my code to do a conf.addResource if Configuration.addDefaultResource does not work. After disabling that workaround, pig can still not find the default resources.

        I then tried removing the default 'false' passed to the Configuration constructor in ConfigurationUtil.toConfiguration and that seems to work again, like before.

        I'm still trying to understand the flow here, but its difficult because ConfigurationUtil.toConfiguration seems to be called multiple times for different reasons.

        Show
        Bhooshan Mogal added a comment - Apologies, please disregard my previous comment. I can still see the same issue even after applying this patch. I forgot that I had a workaround in my code to do a conf.addResource if Configuration.addDefaultResource does not work. After disabling that workaround, pig can still not find the default resources. I then tried removing the default 'false' passed to the Configuration constructor in ConfigurationUtil.toConfiguration and that seems to work again, like before. I'm still trying to understand the flow here, but its difficult because ConfigurationUtil.toConfiguration seems to be called multiple times for different reasons.
        Hide
        Daniel Dai added a comment -

        Let's first figure out when does myfs-site.xml get injected. In HExecutionEngine.init, Pig will create a Configuration with defaults (getExecConf), and put all entries in PigContext.properties. And everytime Pig need a Configuration, create Configuration without defaults and inject all PigContext.properties. What I find is when Pig invokes HExecutionEngine.init, some default resources is not yet get added, and what I do in the patch is to create a Configuration object in later stage and inject missing entries (MapReduceLauncher.launchPig). So let's check:
        1. First tell me is myfs-default.xml/myfs-site.xml default resource (using Configuration.addDefaultResource) or regular resource (using Configuration.addResource)
        2. After HExecutionEngine.init, does PigContext.properties contains your entry?
        3. In MapReduceLauncher.launchPig line "ConfigurationUtil.updateProperties(HExecutionEngine.getExecConf(pc.getProperties()), pc.getProperties());", does Configuration object created in getExecConf has your resource? And after updateProperties, does PigContext.properties contains your entry?

        Show
        Daniel Dai added a comment - Let's first figure out when does myfs-site.xml get injected. In HExecutionEngine.init, Pig will create a Configuration with defaults (getExecConf), and put all entries in PigContext.properties. And everytime Pig need a Configuration, create Configuration without defaults and inject all PigContext.properties. What I find is when Pig invokes HExecutionEngine.init, some default resources is not yet get added, and what I do in the patch is to create a Configuration object in later stage and inject missing entries (MapReduceLauncher.launchPig). So let's check: 1. First tell me is myfs-default.xml/myfs-site.xml default resource (using Configuration.addDefaultResource) or regular resource (using Configuration.addResource) 2. After HExecutionEngine.init, does PigContext.properties contains your entry? 3. In MapReduceLauncher.launchPig line "ConfigurationUtil.updateProperties(HExecutionEngine.getExecConf(pc.getProperties()), pc.getProperties());", does Configuration object created in getExecConf has your resource? And after updateProperties, does PigContext.properties contains your entry?
        Hide
        Bhooshan Mogal added a comment -

        Sorry for the delay. I debugged this issue further last week. I have created a project at https://github.com/bdmogal/pig-3441 to demonstrate this issue. I have a dummy MyFileSystem.java class that adds a resource myfs-site.xml to default resources and a pig script src/main/bin/myfs.pig that invokes this class.

        We tested this last week and found that properties from myfs-site.xml are not found by pig with or without applying this patch on trunk.

        If you then add myfs-site.xml using Configuration.addResource like here, the properties become available.

        Daniel Dai could you please try this and let me know?

        Show
        Bhooshan Mogal added a comment - Sorry for the delay. I debugged this issue further last week. I have created a project at https://github.com/bdmogal/pig-3441 to demonstrate this issue. I have a dummy MyFileSystem.java class that adds a resource myfs-site.xml to default resources and a pig script src/main/bin/myfs.pig that invokes this class. We tested this last week and found that properties from myfs-site.xml are not found by pig with or without applying this patch on trunk. If you then add myfs-site.xml using Configuration.addResource like here , the properties become available. Daniel Dai could you please try this and let me know?
        Hide
        Daniel Dai added a comment -

        I see the exception. The new resource gets introduced in LogicalPlanBuilder.buildLoadOp when we process the load statement. When we create LOLoad, we will read schema from MyFileSystem, and MyFileSystem do expect the entry in myfs.xml there, here is the stack:

        Caused by: java.lang.IllegalStateException: This is the error mentioned in PIG-3441
        	at pig.pig3441.MyFileSystem.initialize(MyFileSystem.java:38)
        	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
        	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
        	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
        	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
        	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
        	at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:70)
        	at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:53)
        	at org.apache.pig.builtin.JsonMetadata.findMetaFile(JsonMetadata.java:109)
        	at org.apache.pig.builtin.JsonMetadata.getSchema(JsonMetadata.java:189)
        	at org.apache.pig.builtin.PigStorage.getSchema(PigStorage.java:538)
        	at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175)
        	at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:89)
        	at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:885)
        

        LogicalPlanBuilder.buildLoadOp is after HExecutionEngine.init, where we inject all the configuration in. so the new entries in myfs-site.xml are not there. By the time we check missing resources again in MapReduceLauncher.launchPig, it is too late. The reason your fix works is because in HDataStorage.init:67, we create configuration with defaults and then pass to FileSystem.get().

        Before we make potentially disrupting change to always create configuration with defaults, can we first introduce a config which you can pass additional resource file (myfs-site.xml)?

        Show
        Daniel Dai added a comment - I see the exception. The new resource gets introduced in LogicalPlanBuilder.buildLoadOp when we process the load statement. When we create LOLoad, we will read schema from MyFileSystem, and MyFileSystem do expect the entry in myfs.xml there, here is the stack: Caused by: java.lang.IllegalStateException: This is the error mentioned in PIG-3441 at pig.pig3441.MyFileSystem.initialize(MyFileSystem.java:38) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:70) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:53) at org.apache.pig.builtin.JsonMetadata.findMetaFile(JsonMetadata.java:109) at org.apache.pig.builtin.JsonMetadata.getSchema(JsonMetadata.java:189) at org.apache.pig.builtin.PigStorage.getSchema(PigStorage.java:538) at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175) at org.apache.pig.newplan.logical.relational.LOLoad.<init>(LOLoad.java:89) at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:885) LogicalPlanBuilder.buildLoadOp is after HExecutionEngine.init, where we inject all the configuration in. so the new entries in myfs-site.xml are not there. By the time we check missing resources again in MapReduceLauncher.launchPig, it is too late. The reason your fix works is because in HDataStorage.init:67, we create configuration with defaults and then pass to FileSystem.get(). Before we make potentially disrupting change to always create configuration with defaults, can we first introduce a config which you can pass additional resource file (myfs-site.xml)?
        Daniel Dai made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]

          People

          • Assignee:
            Daniel Dai
            Reporter:
            Bhooshan Mogal
          • Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development