Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      This JIRA discusses issues relating to the configuration of Pig.

      Uses cases:

      1. I want to configure Pig programatically from Java
      Motivation: pig can be embedded from another Java program, and configuration should be accessible to be set by the client code
      2. I want to configure Pig from the command line
      3. I want to configure Pig from the Pig shell (Grunt)
      4. I want Pig to remember my configuration for every Pig session
      Motivation: to save me typing in some configuration stuff every time.
      5. I want Pig to remember my configuration for this script.
      Motivation: I must use a common configuration for 50% of my Pig scripts - can I share this configuration between scripts.

      Current Status:

      • Pig uses System properties for some configuration
      • A configuration properties object in PigContext is not used.
      • pigrc can contain properties
      • Configuration properties can not be set from Grunt

      Proposed solutions to use cases:
      1. Configuration should be set in PigContext, and accessible from client code.
      2. System properties are copied to PigContext, or can be specified on the command line (duplication with System properties)
      3. Allow configuration properties to be set using the "set" command in Grunt
      4. Pigrc can contain properties. Is this enough, or can other configuration stuff be set, eg aliases, imports, etc.
      5. Add an include directive to pig, to allow a shared configuration/Pig script to be included.

      Connections to Shell scripting:

      • The source command in Bash allows another bash script file to be included - this allows shared variables to be set in one file shared between a set of scripts.
      • Aliases can be set, according to user preferences, etc.
      • All this can be done in your .bashrc file

      Issues:

      • What happens when you change a property after the property has been read?
      • Can Grunt read a pigrc containing various statements etc before the PigServer is completely configured?
      1. TEST-org.apache.pig.test.TestStreaming.txt
        45 kB
        Alan Gates
      2. TEST-org.apache.pig.test.TestStreaming.txt
        41 kB
        Pi Song
      3. TEST-org.apache.pig.test.PigContextTest.txt
        5 kB
        Alan Gates
      4. PIG-93-v02.patch
        17 kB
        Benjamin Francisoud
      5. PIG-93-v01.patch
        23 kB
        Benjamin Francisoud
      6. PIG-111-v06.patch
        100 kB
        Benjamin Francisoud
      7. PIG-111-v05.patch
        98 kB
        Benjamin Francisoud
      8. PIG-111-v04.patch
        90 kB
        Benjamin Francisoud
      9. pig-111_v13.patch
        153 kB
        Pi Song
      10. PIG-111_v12.patch
        153 kB
        Johannes Zillmann
      11. PIG-111_v_9_r641081.patch
        167 kB
        Johannes Zillmann
      12. PIG-111_v_8_r633244M.patch
        96 kB
        Stefan Groschupf
      13. PIG-111_v_7_r633244M.patch
        94 kB
        Stefan Groschupf
      14. PIG-111_v_3_sg.patch
        83 kB
        Stefan Groschupf
      15. PIG-111_v_11.patch
        146 kB
        Johannes Zillmann
      16. PIG_111_v10.patch
        149 kB
        Pi Song
      17. config.patch.1502
        6 kB
        Craig Macdonald
      18. before.png
        6 kB
        Benjamin Francisoud
      19. after.png
        5 kB
        Benjamin Francisoud

        Issue Links

          Activity

          Hide
          Stefan Groschupf added a comment -

          Yeah Thank you Alan!

          Show
          Stefan Groschupf added a comment - Yeah Thank you Alan!
          Hide
          Pi Song added a comment -

          Yeehawww!!!!

          Show
          Pi Song added a comment - Yeehawww!!!!
          Hide
          Johannes Zillmann added a comment -

          Thank you a lot Alan for taking care of this !

          Show
          Johannes Zillmann added a comment - Thank you a lot Alan for taking care of this !
          Hide
          Alan Gates added a comment -

          Fix checked in at revision 646988. I made a few minor changes to get it to work with .pigrc and with HOD.

          I tested that this code now works with a .pigrc file, with a pig.properties file in the jar or in the classpath, on a static HDFS cluster and with HOD.

          Thanks to all (Stefan, Pi, Johannes, and anyone else I missed) who contributed on this long and torturous patch.

          Show
          Alan Gates added a comment - Fix checked in at revision 646988. I made a few minor changes to get it to work with .pigrc and with HOD. I tested that this code now works with a .pigrc file, with a pig.properties file in the jar or in the classpath, on a static HDFS cluster and with HOD. Thanks to all (Stefan, Pi, Johannes, and anyone else I missed) who contributed on this long and torturous patch.
          Hide
          Craig Macdonald added a comment -

          +1 looks good, example pig.properties would be desirable

          Show
          Craig Macdonald added a comment - +1 looks good, example pig.properties would be desirable
          Hide
          Pi Song added a comment -

          +1 please

          Show
          Pi Song added a comment - +1 please
          Hide
          Stefan Groschupf added a comment -

          please reinsert.

          Show
          Stefan Groschupf added a comment - please reinsert.
          Hide
          Alan Gates added a comment -

          I'm testing this patch now. So far, all looks good. One question. At some point in the past, the patch included a template for conf/pig.properties. The latest version of the patch doesn't have this. Is there a reason for this? I thought it was helpful. If no one is opposed I'll re-insert it.

          Show
          Alan Gates added a comment - I'm testing this patch now. So far, all looks good. One question. At some point in the past, the patch included a template for conf/pig.properties. The latest version of the patch doesn't have this. Is there a reason for this? I thought it was helpful. If no one is opposed I'll re-insert it.
          Hide
          Pi Song added a comment -

          New patch again

          • Merged with the current trunk (easy this time)
          • Always read and merge configurations from both .pigrc and pig.properties. pig.properties takes precedence.
          • All tests passed.
          Show
          Pi Song added a comment - New patch again Merged with the current trunk (easy this time) Always read and merge configurations from both .pigrc and pig.properties. pig.properties takes precedence. All tests passed.
          Hide
          Johannes Zillmann added a comment -

          Pi, i forgott to add the "new" files. I've appended a new, hopefully more complete patch.

          Show
          Johannes Zillmann added a comment - Pi, i forgott to add the "new" files. I've appended a new, hopefully more complete patch.
          Hide
          Pi Song added a comment -

          Johannes,

          Your fix to the test looks good but I cannot build after applying the patch. Some classes are missing (e.g. ConfigurationUtil, PropertiesUtil).

          BTW, the other thing that I thought might be a road block is non-intrusive. We should have a good chance to get this it.

          Show
          Pi Song added a comment - Johannes, Your fix to the test looks good but I cannot build after applying the patch. Some classes are missing (e.g. ConfigurationUtil, PropertiesUtil). BTW, the other thing that I thought might be a road block is non-intrusive. We should have a good chance to get this it.
          Hide
          Pi Song added a comment -

          will have a look soon but I'm feeling we're getting a road block again.

          Show
          Pi Song added a comment - will have a look soon but I'm feeling we're getting a road block again.
          Hide
          Johannes Zillmann added a comment -

          Hi Pi,

          sorry i let you alone with this, but i was very busy doing other stuff...
          I applied you lates patch on the latest trunk and did following:

          • fixed a bug in HDataStorage where a created exception was actually not thrown.
          • fixed the creation of the PigServer in TestStreaming. (Creating the PigServer for execution mode MAPRED and using the miniCluster requires to pass the miniClusters properties to the PigServer.
            For me the TestStreaming works now!)
          • did a change on PigServer constructor's regarding the load of the properties from file (they get now only loaded from file if they are not passed in as argument. Please Review!)

          Find attached PIG-111_v_11.patch

          Show
          Johannes Zillmann added a comment - Hi Pi, sorry i let you alone with this, but i was very busy doing other stuff... I applied you lates patch on the latest trunk and did following: fixed a bug in HDataStorage where a created exception was actually not thrown. fixed the creation of the PigServer in TestStreaming. (Creating the PigServer for execution mode MAPRED and using the miniCluster requires to pass the miniClusters properties to the PigServer. For me the TestStreaming works now!) did a change on PigServer constructor's regarding the load of the properties from file (they get now only loaded from file if they are not passed in as argument. Please Review!) Find attached PIG-111 _v_11.patch
          Hide
          Pi Song added a comment -

          remove patch available until the shipping problem is sorted out

          Show
          Pi Song added a comment - remove patch available until the shipping problem is sorted out
          Hide
          Pi Song added a comment -

          Attached is the test report, only 3 cases are failing now.

          Show
          Pi Song added a comment - Attached is the test report, only 3 cases are failing now.
          Hide
          Pi Song added a comment -

          Unfortunately PIG-181 doesn't fix the problem. Need more investigation.

          Show
          Pi Song added a comment - Unfortunately PIG-181 doesn't fix the problem. Need more investigation.
          Hide
          Pi Song added a comment -

          This is such a tiring job. I have generated a new patch in-sync with the current trunk.

          FIXED: PigContextTest failed because of the alias "input" which has become a reserved word in Pig after Streaming. I feel "input", "output", and "error" reserved words are too common alias names that people would use in data processing work

          NOT FIXED: TestStreaming failed should be because of the PIG-181 issue. All the errors on my machine are from shipping and all the error messages look like exactly like what happens in PIG-181 (null reference) ==> Should be fixed as a part of PIG-181

          Alan, I agree that we should have a translation layer from generic config to Hadoop specific config. As for the time being, we still have only Hadoop + local backend, missing it shouldn't cause too much trouble. I suggest trying to get this in first and we will get HConfiguration back soon.
          Reasons:-
          1. This patch has been around for too long and took too much people's effort.
          2. It's a main road block for other stuffs that require configuration
          3. Currently so many basic configurations aren't working = Very bad for users especially first-timers.

          PLEASE TEST WELL BEFORE COMMIT

          Show
          Pi Song added a comment - This is such a tiring job. I have generated a new patch in-sync with the current trunk. FIXED: PigContextTest failed because of the alias "input" which has become a reserved word in Pig after Streaming. I feel "input", "output", and "error" reserved words are too common alias names that people would use in data processing work NOT FIXED: TestStreaming failed should be because of the PIG-181 issue. All the errors on my machine are from shipping and all the error messages look like exactly like what happens in PIG-181 (null reference) ==> Should be fixed as a part of PIG-181 Alan , I agree that we should have a translation layer from generic config to Hadoop specific config. As for the time being, we still have only Hadoop + local backend, missing it shouldn't cause too much trouble. I suggest trying to get this in first and we will get HConfiguration back soon. Reasons:- 1. This patch has been around for too long and took too much people's effort. 2. It's a main road block for other stuffs that require configuration 3. Currently so many basic configurations aren't working = Very bad for users especially first-timers. PLEASE TEST WELL BEFORE COMMIT
          Hide
          Pi Song added a comment -

          Same cause

          Show
          Pi Song added a comment - Same cause
          Hide
          Pi Song added a comment -

          I don't spot anything "so wrong" in this patch and prefer to do things incrementally. The only issue would be Alan's concern about breaking changes to existing user base.

          If there is something that can be fixed later please just create a Jira issue and let this go. Not being able to change simple configurations in the current Pig creates a very bad perception for end-users.

          One lesson: For a big patch, don't try to also reformat code or fix typo mistakes as the patch will get even bigger and difficult to review.

          Show
          Pi Song added a comment - I don't spot anything "so wrong" in this patch and prefer to do things incrementally. The only issue would be Alan's concern about breaking changes to existing user base. If there is something that can be fixed later please just create a Jira issue and let this go. Not being able to change simple configurations in the current Pig creates a very bad perception for end-users. One lesson: For a big patch, don't try to also reformat code or fix typo mistakes as the patch will get even bigger and difficult to review.
          Hide
          Alan Gates added a comment -

          Johannes,

          Thanks for continuing the work on this. When I run your patch, it fails two of the unit tests, PigContextTest and TestStreaming. I've attached the logs for those two runs.

          Show
          Alan Gates added a comment - Johannes, Thanks for continuing the work on this. When I run your patch, it fails two of the unit tests, PigContextTest and TestStreaming. I've attached the logs for those two runs.
          Hide
          Johannes Zillmann added a comment -

          Hi Alan,
          i took Stefans patch, regenerated it against the latest sources and addressed
          (1) - print out warning
          (3) - set jobtracker/namenode not to local if not set

          (2) i leave unaddressed since i personally think that the 2 configuration-objects java.util.Properties and org.apache.hadoop.conf.Configuration are enough and a 3rd one is more confusing as helpful.
          But anyway i can address (2) too, if needed.

          patch is PIG-111_v_9_r641081.patch

          Show
          Johannes Zillmann added a comment - Hi Alan, i took Stefans patch, regenerated it against the latest sources and addressed (1) - print out warning (3) - set jobtracker/namenode not to local if not set (2) i leave unaddressed since i personally think that the 2 configuration-objects java.util.Properties and org.apache.hadoop.conf.Configuration are enough and a 3rd one is more confusing as helpful. But anyway i can address (2) too, if needed. patch is PIG-111 _v_9_r641081.patch
          Hide
          Olga Natkovich added a comment -

          Based on the last review, this patch is not ready to be committed.

          Show
          Olga Natkovich added a comment - Based on the last review, this patch is not ready to be committed.
          Hide
          Alan Gates added a comment -

          Stefan,

          Sorry it took me so long to get back to this.

          I have some issues with this patch that I should have raised earlier.

          1) While I'm fine with saying that going forward .pigrc (bash style) is not our preferred method, we cannot remove it without warning now. We have users who depend on it, so we need to give them some warning before it vanishes. For now, if we find a .pigrc we can issue a warning that says that's deprecated, and won't be supported at some future time.

          2) Why did you remove HConfiguration? I am fine with saying that a Properties object is the way we communicate with the backend. But HConfiguration is not exposed outside of the hadoop specific packages. And it just extends Properties and handles the translation between properties and hadoop configuration. I don't see a problem with that. Am I missing something? As a general rule we should not remove classes unless there is a strong reason to do so.

          3) Currently, a hadoop JobConf object is constructed, if a hadoop-site.xml file is in the class path, the values from that file are picked up and used as part of defining the paramaters for the JobConf. Your change in HExecution engine to set the namenode explicitly to local if it had not been set (around line 115) interferes with this. If a user defines his cluster in hadoop-site.xml instead of pig.properties and you then explicitly set the cluster to local (because the hadoop-site.xml won't get pick up until the JobConf is constructed later) this causes causes his map reduce job to try to run locally (as the reading of the hadoop-site.xml doesn't overwrite the already set values). I'm not clear this is something we want to change. Hadoop will always have configuration values that users will want to set, and we don't want to import those into our config files (that is, we want to support using hadoop-site.xml). Perhaps the solution would be to not default to local if exectype is set to mapreduce.

          Show
          Alan Gates added a comment - Stefan, Sorry it took me so long to get back to this. I have some issues with this patch that I should have raised earlier. 1) While I'm fine with saying that going forward .pigrc (bash style) is not our preferred method, we cannot remove it without warning now. We have users who depend on it, so we need to give them some warning before it vanishes. For now, if we find a .pigrc we can issue a warning that says that's deprecated, and won't be supported at some future time. 2) Why did you remove HConfiguration? I am fine with saying that a Properties object is the way we communicate with the backend. But HConfiguration is not exposed outside of the hadoop specific packages. And it just extends Properties and handles the translation between properties and hadoop configuration. I don't see a problem with that. Am I missing something? As a general rule we should not remove classes unless there is a strong reason to do so. 3) Currently, a hadoop JobConf object is constructed, if a hadoop-site.xml file is in the class path, the values from that file are picked up and used as part of defining the paramaters for the JobConf. Your change in HExecution engine to set the namenode explicitly to local if it had not been set (around line 115) interferes with this. If a user defines his cluster in hadoop-site.xml instead of pig.properties and you then explicitly set the cluster to local (because the hadoop-site.xml won't get pick up until the JobConf is constructed later) this causes causes his map reduce job to try to run locally (as the reading of the hadoop-site.xml doesn't overwrite the already set values). I'm not clear this is something we want to change. Hadoop will always have configuration values that users will want to set, and we don't want to import those into our config files (that is, we want to support using hadoop-site.xml). Perhaps the solution would be to not default to local if exectype is set to mapreduce.
          Hide
          Stefan Groschupf added a comment -

          Hi Pi,
          splitting this patch is difficult since it is simply just this one change switch forom HConfiguration to Properties and change the related code.
          Actually it is pretty simple only the HDataStorage looks strange since it is strange encoded in the SVN.

          Show
          Stefan Groschupf added a comment - Hi Pi, splitting this patch is difficult since it is simply just this one change switch forom HConfiguration to Properties and change the related code. Actually it is pretty simple only the HDataStorage looks strange since it is strange encoded in the SVN.
          Hide
          Pi Song added a comment -

          Stefan,
          Sorry I don't have time to review this patch as I'm working on other areas. But is it possible to split this patch up into small patches? That might help ease some of your pain. Plus I think this patch is important and there are other patches as well that rely on this patch. In short term, it will be difficult for us to make a progress if we cannot make a progress on this.

          Show
          Pi Song added a comment - Stefan, Sorry I don't have time to review this patch as I'm working on other areas. But is it possible to split this patch up into small patches? That might help ease some of your pain. Plus I think this patch is important and there are other patches as well that rely on this patch. In short term, it will be difficult for us to make a progress if we cannot make a progress on this.
          Hide
          Stefan Groschupf added a comment -

          Pi,
          I agree with "Convention-over-Configuration". For sure my patch is not the final solution but brings us close from my point of view.

          Show
          Stefan Groschupf added a comment - Pi, I agree with "Convention-over-Configuration". For sure my patch is not the final solution but brings us close from my point of view.
          Hide
          Pi Song added a comment -

          Regarding this bit,

          Alan Gates:
          Also, I was pondering how to include hadoop specific data here. Right now pig attaches to a cluster by reading a hadoop config file. Obviously we don't want this in our general config file. But maybe the file to be referenced should be referred to in this config file. Or maybe it's ok to just pick the hadoop-site.xml up off the classpath, as is done now. The modification would then be that we only do it if we're in mapreduce mode. Thoughts on this?

          This is just throwing out my idea.
          I propose "Convention-over-Configuration".

          General Config

          • Pig looks for the generic config file in classpath (name is hardcoded)
          • User can override the filename/path in startup command line

          Backend Specific Config

          • Backend looks for the default filename in classpath (name is hardcoded)
          • User can override the filename/path in the generic config file or specified in startup command line.
          • Of course, we pass general config key-values to Backend so Backend can also formulate some key-values from general config
          • (Obvious case) If user specify wrong config key sets for the running backend, we just ignore

          Moreover, I think the interactions between Common-Logging and Log4J is exactly the same as what we're trying to achieve. We might just look at it and apply the same concept.

          BTW Stefan, thanks for your hard work. I really appreciate that (because I had the same experience fixing a big patch over and over again before).

          Show
          Pi Song added a comment - Regarding this bit, Alan Gates: Also, I was pondering how to include hadoop specific data here. Right now pig attaches to a cluster by reading a hadoop config file. Obviously we don't want this in our general config file. But maybe the file to be referenced should be referred to in this config file. Or maybe it's ok to just pick the hadoop-site.xml up off the classpath, as is done now. The modification would then be that we only do it if we're in mapreduce mode. Thoughts on this? This is just throwing out my idea. I propose "Convention-over-Configuration". General Config Pig looks for the generic config file in classpath (name is hardcoded) User can override the filename/path in startup command line Backend Specific Config Backend looks for the default filename in classpath (name is hardcoded) User can override the filename/path in the generic config file or specified in startup command line. Of course, we pass general config key-values to Backend so Backend can also formulate some key-values from general config (Obvious case) If user specify wrong config key sets for the running backend, we just ignore Moreover, I think the interactions between Common-Logging and Log4J is exactly the same as what we're trying to achieve. We might just look at it and apply the same concept. BTW Stefan, thanks for your hard work. I really appreciate that (because I had the same experience fixing a big patch over and over again before).
          Hide
          Stefan Groschupf added a comment -

          This patch fixes reported problems by Alan (executionMode, cluster, getConfiguration, updateConfiguration).

          Show
          Stefan Groschupf added a comment - This patch fixes reported problems by Alan (executionMode, cluster, getConfiguration, updateConfiguration).
          Hide
          Stefan Groschupf added a comment -

          Hi Alan,
          unfortunately jira is not available so I have to response by mail. I will attach the text to the issue later for reference.

          <Alan>
          You modified HExecutionEngine and LocalExecutionEngine to not have updateConfiguration and getConfiguration, but you didn't change the abstract class ExecutionEngine in the same way. Did you intend to remove it from ExecutionEngine as well. Also, why did you want to remove these features? I think it's fine to not implement them for now (as they weren't in Local), but if we want to remove we need a good reason.

          </Alan>
          Sorry I messed this up in the last second during patch generation and didn't notice it. I rolled it back in again, since it looks like you dont wanna remove it. (see next patch version) However I planed to remove it before.
          Here are my thoughts:
          I personal think an interfaces especially in such an early stage of a project should be driven from the implementations not because we might or might not need it. some day in the long future. Having those interfaces and do not implement them in all our implementations seems wrong, we carry around code we do not need, but need to maintain and users get a wrong impressions - eg a update configuration during runtime is possible. Hadoop does not allow changing configurations during runtime (or only very limited) we can only reconnect to a different hadoop cluster or so.
          The getConfiguration method is not sensefull in the executionengine since the pigContext hold all configuration properties, that's why it is a context. Having such a method in the ExecutionEngine gives again a wrong impression to the user that reads the java doc - a user would have the impression that an ExecutionEngine can have a different configuration as the context, but actually that is not the case.

          <Alan>
          Another couple of questions. My expectation in looking at the patch is that by setting exectype in the config file, Main should then respond by starting PigServer with that exectype. But that doesn't happen.
          </Alan>
          Sorry, that is a bug and this is fixed in my next patch verison.
          <Alan>
          Similarly if I set the cluster value to the name and port of a job tracker, I was expecting it to attach to that cluster.
          </Alan>
          Also fixed in the next patch version - thanks for catching this.
          <Alan>
          You changed the default mode from mapreduce to local. We shouldn't make interface changes like that without discussion.
          </Alan>
          I'm sorry you are right we should discuss this first, I wasn't aware that the default mode is map reduce. What makes no sense from my point of view. I think the user experience should be download, unzip start writing pig latin - out of the box. No extra configuration, no additional installation. Therefore I think local execution mode or map reduce but using the hadoop local jobclient should be chosen default.
          <Alan>
          Also, I was pondering how to include hadoop specific data here. Right now pig attaches to a cluster by reading a hadoop config file. Obviously we don't want this in our general config file. But maybe the file to be referenced should be referred to in this config file. Or maybe it's ok to just pick the hadoop-site.xml up off the classpath, as is done now. The modification would then be that we only do it if we're in mapreduce mode. Thoughts on this?
          </Alan>
          What exactly we do need in our configuration hadoop specific? Namenode and Jobtracker. Everything else should be not relevant for us and will be overwritten in the job configuration when we submit a job, or do I misunderstand something here?
          This two properties can be easily in our properties file and actually I would even prefer if we call it mapred.job.tracker and not cluster. Since again it would be easier to understandable for the user. For different execution engines we need different sets of parameters anyway.

          In case we need the hadoop configuration files in the classpath I would suggest do this optional and extend the shell script with a new parameter like -hadoopHome=/myHadoopFolder

          Thanks for taking the time to look into this patch!
          Stefan

          Show
          Stefan Groschupf added a comment - Hi Alan, unfortunately jira is not available so I have to response by mail. I will attach the text to the issue later for reference. <Alan> You modified HExecutionEngine and LocalExecutionEngine to not have updateConfiguration and getConfiguration, but you didn't change the abstract class ExecutionEngine in the same way. Did you intend to remove it from ExecutionEngine as well. Also, why did you want to remove these features? I think it's fine to not implement them for now (as they weren't in Local), but if we want to remove we need a good reason. </Alan> Sorry I messed this up in the last second during patch generation and didn't notice it. I rolled it back in again, since it looks like you dont wanna remove it. (see next patch version) However I planed to remove it before. Here are my thoughts: I personal think an interfaces especially in such an early stage of a project should be driven from the implementations not because we might or might not need it. some day in the long future. Having those interfaces and do not implement them in all our implementations seems wrong, we carry around code we do not need, but need to maintain and users get a wrong impressions - eg a update configuration during runtime is possible. Hadoop does not allow changing configurations during runtime (or only very limited) we can only reconnect to a different hadoop cluster or so. The getConfiguration method is not sensefull in the executionengine since the pigContext hold all configuration properties, that's why it is a context. Having such a method in the ExecutionEngine gives again a wrong impression to the user that reads the java doc - a user would have the impression that an ExecutionEngine can have a different configuration as the context, but actually that is not the case. <Alan> Another couple of questions. My expectation in looking at the patch is that by setting exectype in the config file, Main should then respond by starting PigServer with that exectype. But that doesn't happen. </Alan> Sorry, that is a bug and this is fixed in my next patch verison. <Alan> Similarly if I set the cluster value to the name and port of a job tracker, I was expecting it to attach to that cluster. </Alan> Also fixed in the next patch version - thanks for catching this. <Alan> You changed the default mode from mapreduce to local. We shouldn't make interface changes like that without discussion. </Alan> I'm sorry you are right we should discuss this first, I wasn't aware that the default mode is map reduce. What makes no sense from my point of view. I think the user experience should be download, unzip start writing pig latin - out of the box. No extra configuration, no additional installation. Therefore I think local execution mode or map reduce but using the hadoop local jobclient should be chosen default. <Alan> Also, I was pondering how to include hadoop specific data here. Right now pig attaches to a cluster by reading a hadoop config file. Obviously we don't want this in our general config file. But maybe the file to be referenced should be referred to in this config file. Or maybe it's ok to just pick the hadoop-site.xml up off the classpath, as is done now. The modification would then be that we only do it if we're in mapreduce mode. Thoughts on this? </Alan> What exactly we do need in our configuration hadoop specific? Namenode and Jobtracker. Everything else should be not relevant for us and will be overwritten in the job configuration when we submit a job, or do I misunderstand something here? This two properties can be easily in our properties file and actually I would even prefer if we call it mapred.job.tracker and not cluster. Since again it would be easier to understandable for the user. For different execution engines we need different sets of parameters anyway. In case we need the hadoop configuration files in the classpath I would suggest do this optional and extend the shell script with a new parameter like -hadoopHome=/myHadoopFolder Thanks for taking the time to look into this patch! Stefan
          Hide
          Stefan Groschupf added a comment -

          I will work on a improved patch to address this issues today.
          Thanks for your feedback.

          Show
          Stefan Groschupf added a comment - I will work on a improved patch to address this issues today. Thanks for your feedback.
          Hide
          Alan Gates added a comment -

          Another couple of questions. My expectation in looking at the patch is that by setting exectype in the config file, Main should then respond by starting PigServer with that exectype. But that doesn't happen. In looking at the code I see that exectype is never used. If I set the exectype with a command line argument instead, then it works. Similarly if I set the cluster value to the name and port of a job tracker, I was expecting it to attach to that cluster. But it does not. In order to get it to attach to my cluster I have to include my hadoop-site.xml from the cluster in my classpath. Are these features there for future, but not yet live? If so, we should comment them out for now.

          You changed the default mode from mapreduce to local. We shouldn't make interface changes like that without discussion.

          Also, I was pondering how to include hadoop specific data here. Right now pig attaches to a cluster by reading a hadoop config file. Obviously we don't want this in our general config file. But maybe the file to be referenced should be referred to in this config file. Or maybe it's ok to just pick the hadoop-site.xml up off the classpath, as is done now. The modification would then be that we only do it if we're in mapreduce mode. Thoughts on this?

          Show
          Alan Gates added a comment - Another couple of questions. My expectation in looking at the patch is that by setting exectype in the config file, Main should then respond by starting PigServer with that exectype. But that doesn't happen. In looking at the code I see that exectype is never used. If I set the exectype with a command line argument instead, then it works. Similarly if I set the cluster value to the name and port of a job tracker, I was expecting it to attach to that cluster. But it does not. In order to get it to attach to my cluster I have to include my hadoop-site.xml from the cluster in my classpath. Are these features there for future, but not yet live? If so, we should comment them out for now. You changed the default mode from mapreduce to local. We shouldn't make interface changes like that without discussion. Also, I was pondering how to include hadoop specific data here. Right now pig attaches to a cluster by reading a hadoop config file. Obviously we don't want this in our general config file. But maybe the file to be referenced should be referred to in this config file. Or maybe it's ok to just pick the hadoop-site.xml up off the classpath, as is done now. The modification would then be that we only do it if we're in mapreduce mode. Thoughts on this?
          Hide
          Alan Gates added a comment -

          Stefan,

          You modified HExecutionEngine and LocalExecutionEngine to not have updateConfiguration and getConfiguration, but you didn't change the abstract class ExecutionEngine in the same way. Did you intend to remove it from ExecutionEngine as well. Also, why did you want to remove these features? I think it's fine to not implement them for now (as they weren't in Local), but if we want to remove we need a good reason.

          Show
          Alan Gates added a comment - Stefan, You modified HExecutionEngine and LocalExecutionEngine to not have updateConfiguration and getConfiguration, but you didn't change the abstract class ExecutionEngine in the same way. Did you intend to remove it from ExecutionEngine as well. Also, why did you want to remove these features? I think it's fine to not implement them for now (as they weren't in Local), but if we want to remove we need a good reason.
          Hide
          Stefan Groschupf added a comment -

          It took me significant more time - almost a 8 hours - to regenerate this patch than to create it the first time. However here it goes against r633244. I would be very happy if I do not need to do it again.
          I noticed that something with the encoding of the HDataStorage.java in the repository is wrong.
          Each line as two line breaks (was this file edit with a windows text editor or so ?). I wasn't able to fix that in my patch file, that is why the HDataStorage looks strange in the patch file. All test run at least on my laptop.
          If this patch made it into the svn, I would love to work on getting the test suite running within less than 5 minutes and not 15 min every time. This is painful.

          Show
          Stefan Groschupf added a comment - It took me significant more time - almost a 8 hours - to regenerate this patch than to create it the first time. However here it goes against r633244. I would be very happy if I do not need to do it again. I noticed that something with the encoding of the HDataStorage.java in the repository is wrong. Each line as two line breaks (was this file edit with a windows text editor or so ?). I wasn't able to fix that in my patch file, that is why the HDataStorage looks strange in the patch file. All test run at least on my laptop. If this patch made it into the svn, I would love to work on getting the test suite running within less than 5 minutes and not 15 min every time. This is painful.
          Hide
          Benjamin Francisoud added a comment -

          Stefan can you regenerate it from PIG-111-v06.patch to include unit tests (and move it to the test package) and other improvements (tabs replace by spaces, etc...), thanks

          Show
          Benjamin Francisoud added a comment - Stefan can you regenerate it from PIG-111 -v06.patch to include unit tests (and move it to the test package) and other improvements (tabs replace by spaces, etc...), thanks
          Hide
          Stefan Groschupf added a comment -

          I will regenerate the patch asap.

          Show
          Stefan Groschupf added a comment - I will regenerate the patch asap.
          Hide
          Olga Natkovich added a comment -

          clearing patch ready flag since the patch needs a bit of rework.

          Show
          Olga Natkovich added a comment - clearing patch ready flag since the patch needs a bit of rework.
          Hide
          Alan Gates added a comment -

          Wow, what a patch to review. I have a few comments and issues:

          First the comments:
          The config file conf/pig.properties references conf/log4j.properties, though that file doesn't exist in SVN at this point. Does this patch depend on another patch?

          Why was @SuppressWarnings added in Main.java?

          I know we had some discussion earlier on unit testing and one of the points mentioned was that unit tests should be moved to reside in the packages of the classes they test. However, we haven't yet as a group agreed to that. So until we do, let's keep the tests in org.apache.pig.test (PigContextTest was put in pig.impl).

          Now, some issues:

          1) I ran the unit tests, and saw several failures.

          Testcase: testFunctionInsideFunction took 18.472 sec
          Testcase: testJoin took 11.915 sec
          Testcase: testDriverMethod took 11.831 sec
          Testcase: testMapLookup took 11.85 sec
          Testcase: testBagFunctionWithFlattening took 14.012 sec
          Testcase: testSort took 0.35 sec
          Caused an ERROR
          Bad mapred.job.tracker: local
          java.lang.RuntimeException: Bad mapred.job.tracker: local
          at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:711)
          at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:143)
          at org.apache.pig.impl.PigContext.connect(PigContext.java:154)
          at org.apache.pig.PigServer.<init>(PigServer.java:128)
          at org.apache.pig.test.TestEvalPipeline.testSortDistinct(TestEvalPipeline.java:284)
          at org.apache.pig.test.TestEvalPipeline.testSort(TestEvalPipeline.java:266)

          Testcase: testDistinct took 0.387 sec
          Caused an ERROR
          Bad mapred.job.tracker: local
          java.lang.RuntimeException: Bad mapred.job.tracker: local
          at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:711)
          at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:143)
          at org.apache.pig.impl.PigContext.connect(PigContext.java:154)
          at org.apache.pig.PigServer.<init>(PigServer.java:128)
          at org.apache.pig.test.TestEvalPipeline.testSortDistinct(TestEvalPipeline.java:284)
          at org.apache.pig.test.TestEvalPipeline.testDistinct(TestEvalPipeline.java:271)

          and

          Testcase: testBigGroupAll took 18.651 sec
          Testcase: testStoreFunction took 11.995 sec
          Testcase: testQualifiedFuncions took 11.834 sec
          Testcase: testDefinedFunctions took 11.78 sec
          Testcase: testPigServer took 0.045 sec
          Caused an ERROR
          Bad mapred.job.tracker: local
          java.lang.RuntimeException: Bad mapred.job.tracker: local
          at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:711)
          at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:143)
          at org.apache.pig.impl.PigContext.connect(PigContext.java:154)
          at org.apache.pig.PigServer.<init>(PigServer.java:128)
          at org.apache.pig.test.TestMapReduce.testPigServer(TestMapReduce.java:212)

          2) When I tried to run pig with this change, I could only get it to run in the local mode. In pig.properties I set exectype=mapreduce and cluster=<my jobtracker> and it still only ran in local mode.

          Show
          Alan Gates added a comment - Wow, what a patch to review. I have a few comments and issues: First the comments: The config file conf/pig.properties references conf/log4j.properties, though that file doesn't exist in SVN at this point. Does this patch depend on another patch? Why was @SuppressWarnings added in Main.java? I know we had some discussion earlier on unit testing and one of the points mentioned was that unit tests should be moved to reside in the packages of the classes they test. However, we haven't yet as a group agreed to that. So until we do, let's keep the tests in org.apache.pig.test (PigContextTest was put in pig.impl). Now, some issues: 1) I ran the unit tests, and saw several failures. Testcase: testFunctionInsideFunction took 18.472 sec Testcase: testJoin took 11.915 sec Testcase: testDriverMethod took 11.831 sec Testcase: testMapLookup took 11.85 sec Testcase: testBagFunctionWithFlattening took 14.012 sec Testcase: testSort took 0.35 sec Caused an ERROR Bad mapred.job.tracker: local java.lang.RuntimeException: Bad mapred.job.tracker: local at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:711) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:143) at org.apache.pig.impl.PigContext.connect(PigContext.java:154) at org.apache.pig.PigServer.<init>(PigServer.java:128) at org.apache.pig.test.TestEvalPipeline.testSortDistinct(TestEvalPipeline.java:284) at org.apache.pig.test.TestEvalPipeline.testSort(TestEvalPipeline.java:266) Testcase: testDistinct took 0.387 sec Caused an ERROR Bad mapred.job.tracker: local java.lang.RuntimeException: Bad mapred.job.tracker: local at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:711) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:143) at org.apache.pig.impl.PigContext.connect(PigContext.java:154) at org.apache.pig.PigServer.<init>(PigServer.java:128) at org.apache.pig.test.TestEvalPipeline.testSortDistinct(TestEvalPipeline.java:284) at org.apache.pig.test.TestEvalPipeline.testDistinct(TestEvalPipeline.java:271) and Testcase: testBigGroupAll took 18.651 sec Testcase: testStoreFunction took 11.995 sec Testcase: testQualifiedFuncions took 11.834 sec Testcase: testDefinedFunctions took 11.78 sec Testcase: testPigServer took 0.045 sec Caused an ERROR Bad mapred.job.tracker: local java.lang.RuntimeException: Bad mapred.job.tracker: local at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:711) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:143) at org.apache.pig.impl.PigContext.connect(PigContext.java:154) at org.apache.pig.PigServer.<init>(PigServer.java:128) at org.apache.pig.test.TestMapReduce.testPigServer(TestMapReduce.java:212) 2) When I tried to run pig with this change, I could only get it to run in the local mode. In pig.properties I set exectype=mapreduce and cluster=<my jobtracker> and it still only ran in local mode.
          Hide
          Eugene Kirpichov added a comment -

          I am trying to apply the patches to make the grunt launchable and, since I've never done that before, I'm having problems, like, 'patch' asks me to manually enter names of files (all the time), and it tells things like 'Hunk #1 FAILED' (~30% of the time).

          I guess the second thing definitely has to do with the fact that I'm applying the patches either to the wrong revision, or in the wrong order, or both, whereas the first issue seems to do with my lack of, ehm, patching skills.

          So, my question is: to which revision and in what order should I apply these patches? Or can I find out this information from the patch files by looking at them or processing with some script?

          Show
          Eugene Kirpichov added a comment - I am trying to apply the patches to make the grunt launchable and, since I've never done that before, I'm having problems, like, 'patch' asks me to manually enter names of files (all the time), and it tells things like 'Hunk #1 FAILED' (~30% of the time). I guess the second thing definitely has to do with the fact that I'm applying the patches either to the wrong revision, or in the wrong order, or both, whereas the first issue seems to do with my lack of, ehm, patching skills. So, my question is: to which revision and in what order should I apply these patches? Or can I find out this information from the patch files by looking at them or processing with some script?
          Hide
          Benjamin Francisoud added a comment - - edited

          v06.patch: Removed the constructor Stefan was mentioning in his comment

          But my previous question about removing some more constructors is still pending

          Show
          Benjamin Francisoud added a comment - - edited v06.patch: Removed the constructor Stefan was mentioning in his comment But my previous question about removing some more constructors is still pending
          Hide
          Benjamin Francisoud added a comment -

          I counted 3 constructors where a new Properties() object is created by PigServer...

          I agree with you that constructor signature should state clearly that a configuration (here a properties object) is mandatory and should be provided by the client code using PigServer.

          Do you think we should remove those 3 constructors to be consistent and have a clearer /simpler api ?

          public PigServer(String execType) throws ExecException, IOException {
              this(parseExecType(execType), new Properties());
          }
          public PigServer() throws ExecException {
              this(ExecType.MAPREDUCE, new Properties());
          }
          public PigServer(ExecType execType) throws ExecException, IOException {
              this(execType, new Properties());
          }

          That would leave those 2 constructors in PigServer:

          public PigServer(ExecType execType, Properties properties) throws ExecException {
              PropertiesUtil.loadPropertiesFromFile(properties);
              this.pigContext = new PigContext(execType, properties);
              pigContext.connect();
          }
          public PigServer(PigContext context) throws ExecException {
              this.pigContext = context;
              pigContext.connect();
          }
          Show
          Benjamin Francisoud added a comment - I counted 3 constructors where a new Properties() object is created by PigServer... I agree with you that constructor signature should state clearly that a configuration (here a properties object) is mandatory and should be provided by the client code using PigServer. Do you think we should remove those 3 constructors to be consistent and have a clearer /simpler api ? public PigServer( String execType) throws ExecException, IOException { this (parseExecType(execType), new Properties()); } public PigServer() throws ExecException { this (ExecType.MAPREDUCE, new Properties()); } public PigServer(ExecType execType) throws ExecException, IOException { this (execType, new Properties()); } That would leave those 2 constructors in PigServer: public PigServer(ExecType execType, Properties properties) throws ExecException { PropertiesUtil.loadPropertiesFromFile(properties); this .pigContext = new PigContext(execType, properties); pigContext.connect(); } public PigServer(PigContext context) throws ExecException { this .pigContext = context; pigContext.connect(); }
          Hide
          Stefan Groschupf added a comment -

          Benjamin,
          great thanks!
          Only a minor thing - I personal would prefer if there is no constructor PigServer(ExecType execType) (without properties).
          PigServer should not handle configuration at all - even not creating a empty properties object. This should all happen in the class that uses the pig server.
          This would document in the API that the PigServer need to be configured and if it is just with an empty properties object.
          Just my 2 cents but it is not that important - may be just a question of style.

          So I would be very happy to see this improvement going into trunk soon..

          Show
          Stefan Groschupf added a comment - Benjamin, great thanks! Only a minor thing - I personal would prefer if there is no constructor PigServer(ExecType execType) (without properties). PigServer should not handle configuration at all - even not creating a empty properties object. This should all happen in the class that uses the pig server. This would document in the API that the PigServer need to be configured and if it is just with an empty properties object. Just my 2 cents but it is not that important - may be just a question of style. So I would be very happy to see this improvement going into trunk soon..
          Hide
          Benjamin Francisoud added a comment -

          I tested the patch and it works for me

          I didn't test every possible configuration (especially grunt)
          so if someone could also take a look, it's a big patch...

          v05.patch is the same as v04.patch but I added PigContextTest.java unit test to check that PIG-93 as been fix.

          rem: You will need to add the folder: test/.../pig/impl before applying the patch.

          Show
          Benjamin Francisoud added a comment - I tested the patch and it works for me I didn't test every possible configuration (especially grunt) so if someone could also take a look, it's a big patch... v05.patch is the same as v04.patch but I added PigContextTest.java unit test to check that PIG-93 as been fix. rem: You will need to add the folder: test/.../pig/impl before applying the patch.
          Hide
          Benjamin Francisoud added a comment -

          As Stefan said in a TODO in his patch, I don't see either where those methods are used ?

          DataStorage#updateConfiguration(Properties newConfiguration)
          HDataStorage#updateConfiguration(Properties newConfiguration)
          LocalDataStorage#updateConfiguration(Properties newConfiguration)
          

          May be we could remove them since they don't seem to be use anywhere and it is a bit confusing... what do you think ?

          Show
          Benjamin Francisoud added a comment - As Stefan said in a TODO in his patch, I don't see either where those methods are used ? DataStorage#updateConfiguration(Properties newConfiguration) HDataStorage#updateConfiguration(Properties newConfiguration) LocalDataStorage#updateConfiguration(Properties newConfiguration) May be we could remove them since they don't seem to be use anywhere and it is a bit confusing... what do you think ?
          Hide
          Benjamin Francisoud added a comment -

          Stefan> Just on a code review point of view, this looks exactly like the way I would have dreamed it

          Show
          Benjamin Francisoud added a comment - Stefan> Just on a code review point of view, this looks exactly like the way I would have dreamed it
          Hide
          Benjamin Francisoud added a comment -

          Didn't test the patch yet but here's a slightly modify version of PIG-111_v_3_sg.patch

          • replaced tabs with space for code indentation
          • added constructor PigServer(ExecType execType) (no properties as parameter) to resolve patch conflict between PIG-101/PIG-111)
          • resolved conflict between PIG-101/PIG-111 in test classes
          • replaced duplicated code in some tests from:
            public testSomething1 { PigServer pig = new PigServer(MAPREDUCE, cluster.getProperties());}
            public testSomething2 { PigServer pig = new PigServer(MAPREDUCE, cluster.getProperties());}
            ...
            

          to:

              private PigServer pig;
              
              @Before
              @Override
              protected void setUp() throws Exception {
                  pig = new PigServer(MAPREDUCE, cluster.getProperties());
              }
          
          Show
          Benjamin Francisoud added a comment - Didn't test the patch yet but here's a slightly modify version of PIG-111 _v_3_sg.patch replaced tabs with space for code indentation added constructor PigServer(ExecType execType) (no properties as parameter) to resolve patch conflict between PIG-101 / PIG-111 ) resolved conflict between PIG-101 / PIG-111 in test classes replaced duplicated code in some tests from: public testSomething1 { PigServer pig = new PigServer(MAPREDUCE, cluster.getProperties());} public testSomething2 { PigServer pig = new PigServer(MAPREDUCE, cluster.getProperties());} ... to: private PigServer pig; @Before @Override protected void setUp() throws Exception { pig = new PigServer(MAPREDUCE, cluster.getProperties()); }
          Hide
          Stefan Groschupf added a comment -

          Benjamin: it should fix PIG-93 and PIG-121 as well. In general it should be very very easy to do all kind of configuration for the underlaying hadoop system via API as long you know the keys you need to define.

          Show
          Stefan Groschupf added a comment - Benjamin: it should fix PIG-93 and PIG-121 as well. In general it should be very very easy to do all kind of configuration for the underlaying hadoop system via API as long you know the keys you need to define.
          Hide
          Benjamin Francisoud added a comment -

          I'm catching up pig jiras and mailing list so I might I missed something... but from what I read here:

          • I just learned from Alan comment in PIG-115 and from Pi Song comment that pig is intended to be use for other parallel computing system (I wasn't aware of that, was missing the big picture )

          I will try to avoid to use (or at least reduce the use of) hadoop's classes in my next patches.

          • Stefan I agree with your remarks, actually I started to do a patch as you explain but the changes where becoming so big that I was afraid I couldn't make a patch without regressions, so I decided to keep the changes small, more restricted

          I'll try to review your patch to see if it fixes my other problem PIG-93, related to this, thanks

          Show
          Benjamin Francisoud added a comment - I'm catching up pig jiras and mailing list so I might I missed something... but from what I read here: I just learned from Alan comment in PIG-115 and from Pi Song comment that pig is intended to be use for other parallel computing system (I wasn't aware of that, was missing the big picture ) I will try to avoid to use (or at least reduce the use of) hadoop's classes in my next patches. Stefan I agree with your remarks, actually I started to do a patch as you explain but the changes where becoming so big that I was afraid I couldn't make a patch without regressions, so I decided to keep the changes small, more restricted I'll try to review your patch to see if it fixes my other problem PIG-93 , related to this, thanks
          Hide
          Olga Natkovich added a comment -

          done

          Show
          Olga Natkovich added a comment - done
          Hide
          Stefan Groschupf added a comment -

          Looks like I can not edit the issue and set patch available in case I did not create the issue. Can one of persons that have jira admin rights please give me permissions or set this flag. Thanks.

          Show
          Stefan Groschupf added a comment - Looks like I can not edit the issue and set patch available in case I did not create the issue. Can one of persons that have jira admin rights please give me permissions or set this flag. Thanks.
          Hide
          Stefan Groschupf added a comment -

          Here is my suggestion. I'm sorry it got a little bigger as expected. I know it is painful to review and discuss larger patches but I think it configuration is a very important part.
          Hadoop had big problems in the beginning since they made some mistakes early (the configuration was static). In general I'm very much in favor of Inversion of Control and Constructor based injection of concrete configuration values but I understand that in case we want to switch Execution Engine implementations easily we cant do that.

          I noticed that pig actually only works since all configuration values are set as system properties and than read back as system properties. (System.getProperty). I made very bad experience using system properties in production environments since it is not clear to the user what the values are. Then you run a job taking a week on the wrong cluster and all services are down.

          From my point of view this are the important points:
          + the configuration object itself has no dependencies - java.util.Properies would be the best choice from my point of view
          + the configuration is not static so we pass an properties instance around and do not use system properties at all.
          + each Execution Engine implementation has to take care itself about converting properties into a format the underlaying technology understand (properties to hadoop configuration)
          + a default properties configuration file is part of our distribution (PIG_HOME/conf) and contains all possible configuration values (for documentation) but maybe do only set required values by default

          The attached patch implements those points. I had to change some API- I'm very sorry but I personal think it is cleaner now. I also had to adjust the tests.
          I suggest to apply the patch and review the changed sources instead of reading the patch file.
          Fore sue this is just the starting point and we need furthure improvement in the sources - e.g. I suggest Grunt allows to set all kind of properties not just known once.

          The patch is based on the patches done before for this issue.
          Patch is against r631358. At least on my box the test suite is successfully.

          Show
          Stefan Groschupf added a comment - Here is my suggestion. I'm sorry it got a little bigger as expected. I know it is painful to review and discuss larger patches but I think it configuration is a very important part. Hadoop had big problems in the beginning since they made some mistakes early (the configuration was static). In general I'm very much in favor of Inversion of Control and Constructor based injection of concrete configuration values but I understand that in case we want to switch Execution Engine implementations easily we cant do that. I noticed that pig actually only works since all configuration values are set as system properties and than read back as system properties. (System.getProperty). I made very bad experience using system properties in production environments since it is not clear to the user what the values are. Then you run a job taking a week on the wrong cluster and all services are down. From my point of view this are the important points: + the configuration object itself has no dependencies - java.util.Properies would be the best choice from my point of view + the configuration is not static so we pass an properties instance around and do not use system properties at all. + each Execution Engine implementation has to take care itself about converting properties into a format the underlaying technology understand (properties to hadoop configuration) + a default properties configuration file is part of our distribution (PIG_HOME/conf) and contains all possible configuration values (for documentation) but maybe do only set required values by default The attached patch implements those points. I had to change some API- I'm very sorry but I personal think it is cleaner now. I also had to adjust the tests. I suggest to apply the patch and review the changed sources instead of reading the patch file. Fore sue this is just the starting point and we need furthure improvement in the sources - e.g. I suggest Grunt allows to set all kind of properties not just known once. The patch is based on the patches done before for this issue. Patch is against r631358. At least on my box the test suite is successfully.
          Hide
          Pi Song added a comment -

          I agree with Stefan. HConfiguration should not be in PigContext as this introduces a tight dependency on hadoop.

          • Any idea on how to handle different backends in configuration?
          • Should frontend and backend configurations be seperated and glued with adaptors like in common-logging?

          In my opinion, the current configuration is not quite effective. Anything better than the current one should be OK for at least short term.

          As far as we have the right direction and we follow it, multiple evolution steps is fine : P

          Show
          Pi Song added a comment - I agree with Stefan. HConfiguration should not be in PigContext as this introduces a tight dependency on hadoop. Any idea on how to handle different backends in configuration? Should frontend and backend configurations be seperated and glued with adaptors like in common-logging? In my opinion, the current configuration is not quite effective. Anything better than the current one should be OK for at least short term. As far as we have the right direction and we follow it, multiple evolution steps is fine : P
          Hide
          Stefan Groschupf added a comment -

          The direction is very good and this is very important to do asap, however some general thoughts regarding the topic configuration and this patch

          + storing a non visble .pigrc within the user folder is a very bad idea. This is not transparent for the user.
          ++ configuration files should go into PIG_HOME/conf
          ++ the file should have at least all possible configuration values listed even if they are uncommented.
          ++ in java configuration files have the .properties or .xml extension // users will better and faster recognize that this are configuration files

          + method names like pigContext.getConfiguration().getConfiguration() are missunderstandable how about pigContext.getConfiguration().toHadoopConf() or something.

          + in context of removing hadoop dependencies from Pig the whole HConfiguration makes no sense, since it introduce hadoop dependencies into a configuration object. Instead HExecutionEngine and HDataStoreage should convert the plain properties object into a hadoop conf object.
          + the configuration object should be as generic and stupid as possible ExecutionEngines and DataSorage should handle hadoop releated stuff.

          To illustrate my suggestions I will contribute a patch soon.

          Show
          Stefan Groschupf added a comment - The direction is very good and this is very important to do asap, however some general thoughts regarding the topic configuration and this patch + storing a non visble .pigrc within the user folder is a very bad idea. This is not transparent for the user. ++ configuration files should go into PIG_HOME/conf ++ the file should have at least all possible configuration values listed even if they are uncommented. ++ in java configuration files have the .properties or .xml extension // users will better and faster recognize that this are configuration files + method names like pigContext.getConfiguration().getConfiguration() are missunderstandable how about pigContext.getConfiguration().toHadoopConf() or something. + in context of removing hadoop dependencies from Pig the whole HConfiguration makes no sense, since it introduce hadoop dependencies into a configuration object. Instead HExecutionEngine and HDataStoreage should convert the plain properties object into a hadoop conf object. + the configuration object should be as generic and stupid as possible ExecutionEngines and DataSorage should handle hadoop releated stuff. To illustrate my suggestions I will contribute a patch soon.
          Hide
          Benjamin Francisoud added a comment -

          same patch as PIG-93-v01.patch but without an old unit test (I attached it by error)

          Show
          Benjamin Francisoud added a comment - same patch as PIG-93 -v01.patch but without an old unit test (I attached it by error)
          Hide
          Benjamin Francisoud added a comment -

          This patch is almost the same as Craig's (but with slightly different methods name)

          But it take the refactoring even further by changing various interface as explain in after.png classes diagram to simply classes interactions and life cycles.

          Show
          Benjamin Francisoud added a comment - This patch is almost the same as Craig's (but with slightly different methods name) But it take the refactoring even further by changing various interface as explain in after.png classes diagram to simply classes interactions and life cycles.
          Hide
          Benjamin Francisoud added a comment -

          This class diagram (after.png) describe how the relationship between classes would look like after the refactoring (I will provide a patch).

          As you can see configuration informations are now hold only in PigContext.
          Some methods related to configuration have been move to HConfiguration.
          getConfiguration have been remove from ExecutionEngine interface.
          updateConfiguration has been replace by Properties.putAll() methods since it's already available.

          The overall picture is much clearer and easier to understand (at least in my opinion )

          I also created a getPropertyWithFallback(name) method in HConfiguration.
          It tries to get the property from the HConfiguration object itself. If there is no such property it will try to get the one in System.getProperty. (this is easier to understand looking at the patch I will provide)...
          This allow to key ExecutionEngine behavior almost unchange.

          That's a possible option, what do you think of this one ?

          Show
          Benjamin Francisoud added a comment - This class diagram (after.png) describe how the relationship between classes would look like after the refactoring (I will provide a patch). As you can see configuration informations are now hold only in PigContext. Some methods related to configuration have been move to HConfiguration. getConfiguration have been remove from ExecutionEngine interface. updateConfiguration has been replace by Properties.putAll() methods since it's already available. The overall picture is much clearer and easier to understand (at least in my opinion ) I also created a getPropertyWithFallback(name) method in HConfiguration. It tries to get the property from the HConfiguration object itself. If there is no such property it will try to get the one in System.getProperty. (this is easier to understand looking at the patch I will provide)... This allow to key ExecutionEngine behavior almost unchange. That's a possible option, what do you think of this one ?
          Hide
          Benjamin Francisoud added a comment -

          Craig's patch is heading in the right direction in my opinion

          I made a small classes diagram of the current situation (before.png).

          As you can see the are 2 places where configuration informations are store: PigContext#conf and ExecutionEngine#conf. We could reduce problems by removing one source and use the other one everywhere.

          Also the HConfiguration object could be made a first class citizen of pig by moving some code from PigContext to HConfiguration (setJobtrackerLocation() methods etc...) since PigContext is already around 500 lines and doing all the init of the executionEngine and the connection.
          HConfiguration could be the class in charge of holding config values.

          Show
          Benjamin Francisoud added a comment - Craig's patch is heading in the right direction in my opinion I made a small classes diagram of the current situation (before.png). As you can see the are 2 places where configuration informations are store: PigContext#conf and ExecutionEngine#conf. We could reduce problems by removing one source and use the other one everywhere. Also the HConfiguration object could be made a first class citizen of pig by moving some code from PigContext to HConfiguration (setJobtrackerLocation() methods etc...) since PigContext is already around 500 lines and doing all the init of the executionEngine and the connection. HConfiguration could be the class in charge of holding config values.
          Hide
          Craig Macdonald added a comment -

          This patch is a start in the proposed direction: It addresses proposed solutions 1 and 3.

          Show
          Craig Macdonald added a comment - This patch is a start in the proposed direction: It addresses proposed solutions 1 and 3.

            People

            • Assignee:
              Stefan Groschupf
              Reporter:
              Craig Macdonald
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development