Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2151

[rumen] Add a map of jobconf key-value pairs in LoggedJob

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: tools/rumen
    • Labels:
      None

      Description

      It'd be useful to retain application level configuration settings in LoggedJob.

        Activity

        Hide
        Ravi Gummadi added a comment -

        All the configuration settings that are not same as "the values that are there in XXX-default.xml" would be good enough ?

        Show
        Ravi Gummadi added a comment - All the configuration settings that are not same as "the values that are there in XXX-default.xml" would be good enough ?
        Hide
        Amar Kamat added a comment -

        I think Hong is referring to configuration parameters that are likely to modify the behaviour of the job and tasks (e.g mapred.child.* , mapreduce.job.* etc). Hong, is my understanding correct? Also *-default.xml might not be available for reference comparison. Hence for now we might need to identify (i.e handpick) these configuration parameters and add them to interesting properties list.

        Show
        Amar Kamat added a comment - I think Hong is referring to configuration parameters that are likely to modify the behaviour of the job and tasks (e.g mapred.child. * , mapreduce.job. * etc). Hong, is my understanding correct? Also * -default.xml might not be available for reference comparison. Hence for now we might need to identify (i.e handpick) these configuration parameters and add them to interesting properties list.
        Hide
        Ravi Gummadi added a comment -

        Hmm. But I guess we need to bring in more and more configuration properties soon.
        Created MAPREDUCE-2153 to get other needed configuration properties in to the trace file.

        Also created MAPREDUCE-2152 for avoiding TraceBuilder's its own handling of deprecated configuration properties in favour of Configuration object.

        Show
        Ravi Gummadi added a comment - Hmm. But I guess we need to bring in more and more configuration properties soon. Created MAPREDUCE-2153 to get other needed configuration properties in to the trace file. Also created MAPREDUCE-2152 for avoiding TraceBuilder's its own handling of deprecated configuration properties in favour of Configuration object.
        Hide
        Hong Tang added a comment -

        I think Hong is referring to configuration parameters that are likely to modify the behaviour of the job and tasks (e.g mapred.child.* , mapreduce.job.* etc).

        No, this is not what this jira intends to solve. But this jira could potentially help. Currently Rumen extracts from jobconf.xml some key-values specific to map-reduce layer, and converts them to regular primitive types. I think the extraction of mapred.child.* and mapreduce.job.* etc should continue along this path.

        However, we start to think of using Rumen output to analyze performance of frameworks on top of map-reduce. One example is Pig. Pig will add more information in jobconf.xml to describe the features being used, and compile-time statistics, We need to have a mechanism in Rumen to retain such information in an extensible way, and is the primary purpose of this jira.

        Also *-default.xml might not be available for reference comparison.

        Correct. That is the main reason we have to make each parsed LoggedJob instance self-contained.

        Hmm. But I guess we need to bring in more and more configuration properties soon.

        Yes, it will be, but not unbounded. I think we can support extraction of properties based on exact match or prefixes.

        Created MAPREDUCE-2153 to get other needed configuration properties in to the trace file.

        This seems to be in addition to MAPREDUCE-1658. I suggest you roll two jiras into one (closing MR-1658 and roll the work int oMR-2153).

        Also created MAPREDUCE-2152 for avoiding TraceBuilder's its own handling of deprecated configuration properties in favour of Configuration object.

        The purpose of this jira is to extend the set of key-values to be extracted by jobconf parser and retain them as-is in LoggedJob object. So I believe your point is relatively orthogonal to this jira. FWIW, I am a bit concerned to introduce this dependency between Rumen and MapReduce because I think the handling deprecated conf parameters is not really a core part of MapReduce API and could be dropped in the future (which would lead us to move the code into Rumen - similar to the case of Pre21JobHistoryConstants).

        Show
        Hong Tang added a comment - I think Hong is referring to configuration parameters that are likely to modify the behaviour of the job and tasks (e.g mapred.child.* , mapreduce.job.* etc). No, this is not what this jira intends to solve. But this jira could potentially help. Currently Rumen extracts from jobconf.xml some key-values specific to map-reduce layer, and converts them to regular primitive types. I think the extraction of mapred.child.* and mapreduce.job.* etc should continue along this path. However, we start to think of using Rumen output to analyze performance of frameworks on top of map-reduce. One example is Pig. Pig will add more information in jobconf.xml to describe the features being used, and compile-time statistics, We need to have a mechanism in Rumen to retain such information in an extensible way, and is the primary purpose of this jira. Also *-default.xml might not be available for reference comparison. Correct. That is the main reason we have to make each parsed LoggedJob instance self-contained. Hmm. But I guess we need to bring in more and more configuration properties soon. Yes, it will be, but not unbounded. I think we can support extraction of properties based on exact match or prefixes. Created MAPREDUCE-2153 to get other needed configuration properties in to the trace file. This seems to be in addition to MAPREDUCE-1658 . I suggest you roll two jiras into one (closing MR-1658 and roll the work int oMR-2153). Also created MAPREDUCE-2152 for avoiding TraceBuilder's its own handling of deprecated configuration properties in favour of Configuration object. The purpose of this jira is to extend the set of key-values to be extracted by jobconf parser and retain them as-is in LoggedJob object. So I believe your point is relatively orthogonal to this jira. FWIW, I am a bit concerned to introduce this dependency between Rumen and MapReduce because I think the handling deprecated conf parameters is not really a core part of MapReduce API and could be dropped in the future (which would lead us to move the code into Rumen - similar to the case of Pre21JobHistoryConstants).
        Hide
        Hong Tang added a comment -

        Patch for yahoop hadoop 20.200.

        Show
        Hong Tang added a comment - Patch for yahoop hadoop 20.200.
        Hide
        Hong Tang added a comment -

        What are included in the patch I uploaded:

        • Added a Map<String, String> field named "configuration" in LoggedJob.
        • Changed JobConfigurationParser to handle extraction based on prefix. Also handle a special case to extract all key-value pairs (by specifying the interested prefix list to be null).
        • Added a StringTrie that supports the partial matching needed for JobConfigurationParser.
        • Added unit tests for JobConfigurationParser and StringTrie.
        • Updated the existing unit tests (the golden files to include the configuration object).
        Show
        Hong Tang added a comment - What are included in the patch I uploaded: Added a Map<String, String> field named "configuration" in LoggedJob. Changed JobConfigurationParser to handle extraction based on prefix. Also handle a special case to extract all key-value pairs (by specifying the interested prefix list to be null). Added a StringTrie that supports the partial matching needed for JobConfigurationParser. Added unit tests for JobConfigurationParser and StringTrie. Updated the existing unit tests (the golden files to include the configuration object).
        Hide
        Ravi Gummadi added a comment -

        (1) The paths to the new tests seem to be different from existing tests. src/test/org/apache/hadoop/tools/rumen/Test* instead of src/test/mapred/org/apache/hadoop/tools/rumen/. Is this intentional ?

        (2) TestRumenJobTraces has testJobConfigurationParser() and this patch added TestJobConfigurationParser.java. Can the old testcase be removed/moved from TestRumenJobTraces ?

        (3) Are we not targeting to get these interested configuration properties into trace file ? I don't see a way to specify the interested properties regular expression or the prefix of the interested configuration properties in TraceBuilder. With this patch also, the trace file doesn't contain any new configuration properties as TraceBuilder.run() gets the interested properties from JobConfPropertyNames. Should we add an option to TraceBuilder that takes regular expression and dumps all the config properties that match the regular expression ?

        (4) This patch matches only the first part of configuration property(till first "."). Is there an easy way to make it match more than that ?
        Basically, I would like to specify "mapreduce.job." as prefix for my interested configuration properties.
        Also it would be cool to be able to specify uninterested properties' prefixes like "mapreduce.jobtracker." and "mapreduce.tasktracker." to get all other config properties other than those matching these 2 patterns.

        Show
        Ravi Gummadi added a comment - (1) The paths to the new tests seem to be different from existing tests. src/test/org/apache/hadoop/tools/rumen/Test* instead of src/test/mapred/org/apache/hadoop/tools/rumen/. Is this intentional ? (2) TestRumenJobTraces has testJobConfigurationParser() and this patch added TestJobConfigurationParser.java. Can the old testcase be removed/moved from TestRumenJobTraces ? (3) Are we not targeting to get these interested configuration properties into trace file ? I don't see a way to specify the interested properties regular expression or the prefix of the interested configuration properties in TraceBuilder. With this patch also, the trace file doesn't contain any new configuration properties as TraceBuilder.run() gets the interested properties from JobConfPropertyNames. Should we add an option to TraceBuilder that takes regular expression and dumps all the config properties that match the regular expression ? (4) This patch matches only the first part of configuration property(till first "."). Is there an easy way to make it match more than that ? Basically, I would like to specify "mapreduce.job." as prefix for my interested configuration properties. Also it would be cool to be able to specify uninterested properties' prefixes like "mapreduce.jobtracker." and "mapreduce.tasktracker." to get all other config properties other than those matching these 2 patterns.
        Hide
        Hong Tang added a comment -

        (1) The paths to the new tests seem to be different from existing tests. src/test/org/apache/hadoop/tools/rumen/Test* instead of src/test/mapred/org/apache/hadoop/tools/rumen/. Is this intentional ?

        No, the tests for trunk are under src/test/mapred. This patch is for yahoo hadoop 0.20.

        (2) TestRumenJobTraces has testJobConfigurationParser() and this patch added TestJobConfigurationParser.java. Can the old testcase be removed/moved from TestRumenJobTraces ?

        I was not aware of that. I think your suggestion makes sense.

        (3) Are we not targeting to get these interested configuration properties into trace file ?

        Yes, TraceBuilder needs to be modified to expose the new feature to end user. Will add it.

        (4) This patch matches only the first part of configuration property(till first ".")...

        I do not follow your first part. For the second part (exclusion list), it will add significant complexity (now the order of the list may matter). I suggest we wait until some concrete usage case emerge.

        Show
        Hong Tang added a comment - (1) The paths to the new tests seem to be different from existing tests. src/test/org/apache/hadoop/tools/rumen/Test* instead of src/test/mapred/org/apache/hadoop/tools/rumen/. Is this intentional ? No, the tests for trunk are under src/test/mapred. This patch is for yahoo hadoop 0.20. (2) TestRumenJobTraces has testJobConfigurationParser() and this patch added TestJobConfigurationParser.java. Can the old testcase be removed/moved from TestRumenJobTraces ? I was not aware of that. I think your suggestion makes sense. (3) Are we not targeting to get these interested configuration properties into trace file ? Yes, TraceBuilder needs to be modified to expose the new feature to end user. Will add it. (4) This patch matches only the first part of configuration property(till first ".")... I do not follow your first part. For the second part (exclusion list), it will add significant complexity (now the order of the list may matter). I suggest we wait until some concrete usage case emerge.
        Hide
        Ravi Gummadi added a comment -

        >>I do not follow your first part. For the second part (exclusion list), it will add significant complexity (now the order of the list may matter). I suggest we wait until some concrete usage case emerge.

        I didn't go through the match() method earlier. But what I wanted to give as input is something like "mapreduce.job." as the prefix and get all the config properties that start with "mapreduce.job." into the trace file. I think it can be done. Let me see your next patch which adds an option to TraceBuiler and see if anything is missing from my expectations.

        I agree that it may be complex to support exclusion list with StringTrie. Let us see that case later if that case is really important.

        Show
        Ravi Gummadi added a comment - >>I do not follow your first part. For the second part (exclusion list), it will add significant complexity (now the order of the list may matter). I suggest we wait until some concrete usage case emerge. I didn't go through the match() method earlier. But what I wanted to give as input is something like "mapreduce.job." as the prefix and get all the config properties that start with "mapreduce.job." into the trace file. I think it can be done. Let me see your next patch which adds an option to TraceBuiler and see if anything is missing from my expectations. I agree that it may be complex to support exclusion list with StringTrie. Let us see that case later if that case is really important.
        Hide
        Hong Tang added a comment -

        Patch that addresses ravi's comments.

        • Moved a testcase from TestRumenJobTraces to TestJobConfigurationParser.
        • Added the plumbing to allow users to add to the list of properties to be extracted from jobconf.

        Note that to extract all properties started with "mapred.*", one just needs to specify "mapred". The prefix (or property name) string is first tokenized by splitting based on ".", and then each token has to be an exact match of the property string tokens in sequence. So it will not match any properties start with "mapreduce".

        Show
        Hong Tang added a comment - Patch that addresses ravi's comments. Moved a testcase from TestRumenJobTraces to TestJobConfigurationParser. Added the plumbing to allow users to add to the list of properties to be extracted from jobconf. Note that to extract all properties started with "mapred.*", one just needs to specify "mapred". The prefix (or property name) string is first tokenized by splitting based on ".", and then each token has to be an exact match of the property string tokens in sequence. So it will not match any properties start with "mapreduce".
        Hide
        Amar Kamat added a comment -

        Hong,
        Isn't it simple to call Configuration.getValByRegex() for each regex specified by the gridmix user and populate the key-val pair maintained in JobConfigurationParser?

        Show
        Amar Kamat added a comment - Hong, Isn't it simple to call Configuration.getValByRegex() for each regex specified by the gridmix user and populate the key-val pair maintained in JobConfigurationParser ?
        Hide
        Amar Kamat added a comment -

        This got committed as part of MAPREDUCE-2153.

        Show
        Amar Kamat added a comment - This got committed as part of MAPREDUCE-2153 .

          People

          • Assignee:
            Hong Tang
            Reporter:
            Hong Tang
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development