Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Adds Reduce Attempt ID to ClientTrace log messages, and adds Reduce Attempt ID to HTTP query string sent to mapOutputServlet. Extracts partition number from attempt ID.

      Description

      Current clienttrace messages from shuffles note only the destination map ID but not the source reduce ID. Having both source and destination ID of each shuffle enables full tracing of execution.

      1. MAPREDUCE-479-4.patch
        3 kB
        Jiaqi Tan
      2. MAPREDUCE-479-3.patch
        3 kB
        Jiaqi Tan
      3. MAPREDUCE-479-2.patch
        3 kB
        Jiaqi Tan
      4. MAPREDUCE-479-1.patch
        7 kB
        Jiaqi Tan
      5. MAPREDUCE-479.patch
        2 kB
        Jiaqi Tan
      6. HADOOP-6013.patch
        1 kB
        Jiaqi Tan

        Issue Links

          Activity

          Jiaqi Tan created issue -
          Hide
          Jiaqi Tan added a comment -

          Adds reduceID to shuffle clienttrace messages

          Show
          Jiaqi Tan added a comment - Adds reduceID to shuffle clienttrace messages
          Jiaqi Tan made changes -
          Field Original Value New Value
          Attachment HADOOP-6013.patch [ 12410427 ]
          Hide
          Jiaqi Tan added a comment -

          Add reduce ID to shuffle clienttrace messages

          Show
          Jiaqi Tan added a comment - Add reduce ID to shuffle clienttrace messages
          Jiaqi Tan made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Jiaqi Tan added a comment -

          Lowered severity to minor.

          Show
          Jiaqi Tan added a comment - Lowered severity to minor.
          Jiaqi Tan made changes -
          Priority Major [ 3 ] Minor [ 4 ]
          Hide
          Jiaqi Tan added a comment -

          Destination reduceID of shuffle needed in Chukwa for building causal tracing between maps and reduces; this saves the effort of having to parse task logs for shuffle information, and greatly reduces the number of adaptors needed (eliminates need for per-task adaptors).

          Show
          Jiaqi Tan added a comment - Destination reduceID of shuffle needed in Chukwa for building causal tracing between maps and reduces; this saves the effort of having to parse task logs for shuffle information, and greatly reduces the number of adaptors needed (eliminates need for per-task adaptors).
          Jiaqi Tan made changes -
          Link This issue relates to HADOOP-5876 [ HADOOP-5876 ]
          Hide
          Jiaqi Tan added a comment -

          Much easier way of getting information about shuffles that was previously requested with much lower overhead since the ClientTrace messages are already being generated.

          Show
          Jiaqi Tan added a comment - Much easier way of getting information about shuffles that was previously requested with much lower overhead since the ClientTrace messages are already being generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12410427/HADOOP-6013.patch
          against trunk revision 785065.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/502/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/502/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/502/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/502/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12410427/HADOOP-6013.patch against trunk revision 785065. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/502/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/502/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/502/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/502/console This message is automatically generated.
          Hide
          Jiaqi Tan added a comment -

          Both failed unit tests are not related to this patch.

          Show
          Jiaqi Tan added a comment - Both failed unit tests are not related to this patch.
          Owen O'Malley made changes -
          Project Hadoop Common [ 12310240 ] Hadoop Map/Reduce [ 12310941 ]
          Key HADOOP-6013 MAPREDUCE-479
          Affects Version/s 0.20.0 [ 12313438 ]
          Component/s mapred [ 12310690 ]
          Fix Version/s 0.21.0 [ 12313563 ]
          Hide
          Eric Yang added a comment -

          +1 looks good.

          Show
          Eric Yang added a comment - +1 looks good.
          Hide
          Chris Douglas added a comment -

          Including the reduceId is clearly useful, but I'd prefer the parameter be named "reduceId" since it cannot be associated with the srvID from DataNode (in contrast, the cliID fields can be associated with some effort and approximation). The patch also needs to be regenerated relative to the current subproject tree.

          Show
          Chris Douglas added a comment - Including the reduceId is clearly useful, but I'd prefer the parameter be named "reduceId" since it cannot be associated with the srvID from DataNode (in contrast, the cliID fields can be associated with some effort and approximation). The patch also needs to be regenerated relative to the current subproject tree.
          Chris Douglas made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Assignee Jiaqi Tan [ tanjiaqi ]
          Hide
          Jiaqi Tan added a comment -

          It turns out that the reduce from the request is just a single integer rather than the full reduce ID, and I'm trying to generate the full reduce ID, but org.apache.hadoop.mapred.TaskID is deprecated; is there an alternative way to generate task IDs?

          Show
          Jiaqi Tan added a comment - It turns out that the reduce from the request is just a single integer rather than the full reduce ID, and I'm trying to generate the full reduce ID, but org.apache.hadoop.mapred.TaskID is deprecated; is there an alternative way to generate task IDs?
          Hide
          Chris Douglas added a comment -

          No, the task attempt ID isn't part of the request. The TT doesn't need to know which attempt is requesting the data, only which partition of the map output is being requested.

          Show
          Chris Douglas added a comment - No, the task attempt ID isn't part of the request. The TT doesn't need to know which attempt is requesting the data, only which partition of the map output is being requested.
          Hide
          Jiaqi Tan added a comment -

          Is it acceptable to add the reduce ID to the HTTP request coming from the ReduceTask for tracing purposes?

          Show
          Jiaqi Tan added a comment - Is it acceptable to add the reduce ID to the HTTP request coming from the ReduceTask for tracing purposes?
          Hide
          Chris Douglas added a comment -

          That would be suboptimal... it's not actually a parameter in the request and maintaining it as a necessary side-effect requires future versions to preserve it. There are some interesting possibilities for unit tests, e.g. returning corrupt input for the first attempt but not the second, but the real use cases don't care which attempt is asking for the data. Unless different attempts are scheduled on the same TaskTracker- an illegal case in the current code- it should be possible to disambiguate which task attempt requested the data using the src/dst pair in clienttrace, correlating it with the JT history and other TT logs. Would this be sufficient?

          I suppose the value is more aptly keyed as "reduce" rather than "reduceID"...

          Show
          Chris Douglas added a comment - That would be suboptimal... it's not actually a parameter in the request and maintaining it as a necessary side-effect requires future versions to preserve it. There are some interesting possibilities for unit tests, e.g. returning corrupt input for the first attempt but not the second, but the real use cases don't care which attempt is asking for the data. Unless different attempts are scheduled on the same TaskTracker- an illegal case in the current code- it should be possible to disambiguate which task attempt requested the data using the src/dst pair in clienttrace, correlating it with the JT history and other TT logs. Would this be sufficient? I suppose the value is more aptly keyed as "reduce" rather than "reduceID"...
          Hide
          Jiaqi Tan added a comment -

          How would the disambiguation occur? Given that the dest is the reduce host, what if there are N reducers running on the same host, and more than one of these reducers request the same map output? In this case, how would the disambiguation occur?

          Also, in a related log statement, (approximately TaskTracker.java:2946)

          LOG.info("Sent out " + totalRead + " bytes for reduce: " + reduce +
          " from map: " + mapId + " given " + info.partLength + "/" +
          info.rawLength);

          The "reduce" used to return the attempt ID of the reduce (in 0.19 and earlier), but it now returns a single integer in trunk. That makes it difficult to disambiguate using TT logs as well because the TT logs use the same reduce value to identify the requesting reducer of the map output.

          Show
          Jiaqi Tan added a comment - How would the disambiguation occur? Given that the dest is the reduce host, what if there are N reducers running on the same host, and more than one of these reducers request the same map output? In this case, how would the disambiguation occur? Also, in a related log statement, (approximately TaskTracker.java:2946) LOG.info("Sent out " + totalRead + " bytes for reduce: " + reduce + " from map: " + mapId + " given " + info.partLength + "/" + info.rawLength); The "reduce" used to return the attempt ID of the reduce (in 0.19 and earlier), but it now returns a single integer in trunk. That makes it difficult to disambiguate using TT logs as well because the TT logs use the same reduce value to identify the requesting reducer of the map output.
          Hide
          Chris Douglas added a comment -

          How would the disambiguation occur? Given that the dest is the reduce host, what if there are N reducers running on the same host, and more than one of these reducers request the same map output? In this case, how would the disambiguation occur?

          If reduce i requests map output j, then the reduce belongs to the same job. If N reducers are running on the same host, only one has the ID i since (as mentioned previously) two attempts of the same task will not be scheduled on the same host. The job, reduce id, and related data should be sufficient to uniquely identify the reduce attempt.

          Show
          Chris Douglas added a comment - How would the disambiguation occur? Given that the dest is the reduce host, what if there are N reducers running on the same host, and more than one of these reducers request the same map output? In this case, how would the disambiguation occur? If reduce i requests map output j , then the reduce belongs to the same job. If N reducers are running on the same host, only one has the ID i since (as mentioned previously) two attempts of the same task will not be scheduled on the same host. The job, reduce id, and related data should be sufficient to uniquely identify the reduce attempt.
          Hide
          Jiaqi Tan added a comment -

          Cleaned up patch for new branched tree

          Show
          Jiaqi Tan added a comment - Cleaned up patch for new branched tree
          Jiaqi Tan made changes -
          Attachment MAPREDUCE-479.patch [ 12413251 ]
          Jiaqi Tan made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note Adds Reduce ID to ClientTrace log messages. Explicitly uses new mapreduce.JobID for compatibility with updated TaskID constructor.
          Affects Version/s 0.21.0 [ 12314045 ]
          Fix Version/s 0.21.0 [ 12314045 ]
          Hide
          Jiaqi Tan added a comment -

          Will submit a new patch to add reduce attempt ID to eliminate assumption that no 2 attempts will run on same host, in case the assumption breaks in post-0.20 scheduling.

          Show
          Jiaqi Tan added a comment - Will submit a new patch to add reduce attempt ID to eliminate assumption that no 2 attempts will run on same host, in case the assumption breaks in post-0.20 scheduling.
          Jiaqi Tan made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Jiaqi Tan added a comment -

          Adds reduce attempt ID to shuffle call to mapOutputServlet's query string to enable true causal tracing. Eliminates assumption in tracing that no two attempts of the same task can run on the same node; even if we allow two attempts of the same task to run on the same node, we need globally synchronized clocks to disambiguate them.

          Show
          Jiaqi Tan added a comment - Adds reduce attempt ID to shuffle call to mapOutputServlet's query string to enable true causal tracing. Eliminates assumption in tracing that no two attempts of the same task can run on the same node; even if we allow two attempts of the same task to run on the same node, we need globally synchronized clocks to disambiguate them.
          Jiaqi Tan made changes -
          Attachment MAPREDUCE-479-1.patch [ 12413322 ]
          Hide
          Jiaqi Tan added a comment -

          I would prefer adding the reduce attempt ID to the HTTP query string because this eliminates the need for assuming that no two attempts of the same task can run on the same node; I can see scenarios where a custom scheduler may break this assumption and make tracing very complicated. The incremental cost in terms of additional network traffic of adding the reduce attempt ID should be minimal and much smaller than the total data shuffled in a typical job.

          Show
          Jiaqi Tan added a comment - I would prefer adding the reduce attempt ID to the HTTP query string because this eliminates the need for assuming that no two attempts of the same task can run on the same node; I can see scenarios where a custom scheduler may break this assumption and make tracing very complicated. The incremental cost in terms of additional network traffic of adding the reduce attempt ID should be minimal and much smaller than the total data shuffled in a typical job.
          Jiaqi Tan made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note Adds Reduce ID to ClientTrace log messages. Explicitly uses new mapreduce.JobID for compatibility with updated TaskID constructor. Adds Reduce Attempt ID to ClientTrace log messages, and adds Reduce Attempt ID to HTTP query string sent to mapOutputServlet.
          Hide
          Jiaqi Tan added a comment -

          Sorry tainted patch, not the latest. Will upload cleaned up one.

          Show
          Jiaqi Tan added a comment - Sorry tainted patch, not the latest. Will upload cleaned up one.
          Jiaqi Tan made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Jiaqi Tan added a comment -

          Updated, correct patch.

          Show
          Jiaqi Tan added a comment - Updated, correct patch.
          Jiaqi Tan made changes -
          Attachment MAPREDUCE-479-2.patch [ 12413324 ]
          Jiaqi Tan made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12413324/MAPREDUCE-479-2.patch
          against trunk revision 793457.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/385/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/385/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/385/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/385/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12413324/MAPREDUCE-479-2.patch against trunk revision 793457. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/385/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/385/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/385/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/385/console This message is automatically generated.
          Hide
          Jiaqi Tan added a comment -

          Updated with fixed message format string.

          Show
          Jiaqi Tan added a comment - Updated with fixed message format string.
          Jiaqi Tan made changes -
          Attachment MAPREDUCE-479-3.patch [ 12413440 ]
          Hide
          Jiaqi Tan added a comment -

          Tests not required because the updates were made only to logging statements and do not affect core functionality. Manual verification was performed with a build and hand-inspection of HTTP query string of shuffle requests and clienttrace log messages generated by the TaskTracker.

          Show
          Jiaqi Tan added a comment - Tests not required because the updates were made only to logging statements and do not affect core functionality. Manual verification was performed with a build and hand-inspection of HTTP query string of shuffle requests and clienttrace log messages generated by the TaskTracker.
          Hide
          Chris Douglas added a comment -

          That would be suboptimal... it's not actually a parameter in the request and maintaining it as a necessary side-effect requires future versions to preserve it.

          I remain opposed to adding a string to the query to be logged on the remote side. If you want to make the case for replacing the partition with the attempt ID- and extracting the partition from it on the TaskTracker side- I would be +0 on that approach.

          Show
          Chris Douglas added a comment - That would be suboptimal... it's not actually a parameter in the request and maintaining it as a necessary side-effect requires future versions to preserve it. I remain opposed to adding a string to the query to be logged on the remote side. If you want to make the case for replacing the partition with the attempt ID- and extracting the partition from it on the TaskTracker side- I would be +0 on that approach.
          Chris Douglas made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Jiaqi Tan added a comment -

          Traceability greatly improves the ability for diagnosis to be carried out on the system, and I would argue that the ability for the system to be diagnosed is a first-class feature that is important for its own sake; if the system's operation cannot be exposed at all then that greatly detracts from the usability of the system for users.

          I would still argue for the side-effect to be maintained because needing to perform time-based + host-based disambiguation of multiple attempts of the same task requires the assumption that the same task will not have multiple attempts being run on the same host, which is even more opaque than maintaining a publicly-documented (i.e. with this JIRA) side-effect.

          Show
          Jiaqi Tan added a comment - Traceability greatly improves the ability for diagnosis to be carried out on the system, and I would argue that the ability for the system to be diagnosed is a first-class feature that is important for its own sake; if the system's operation cannot be exposed at all then that greatly detracts from the usability of the system for users. I would still argue for the side-effect to be maintained because needing to perform time-based + host-based disambiguation of multiple attempts of the same task requires the assumption that the same task will not have multiple attempts being run on the same host, which is even more opaque than maintaining a publicly-documented (i.e. with this JIRA) side-effect.
          Hide
          Chris Douglas added a comment -

          if the system's operation cannot be exposed at all then that greatly detracts from the usability of the system for users. [...] needing to perform time-based + host-based disambiguation of multiple attempts of the same task requires the assumption that the same task will not have multiple attempts being run on the same host, which is even more opaque than maintaining a publicly-documented (i.e. with this JIRA) side-effect.

          I think you and I differ in our perspectives of users. Even granting the premise that traceability greatly improves diagnosis of system problems, that Hadoop will be less usable to those writing map/reduce programs because it's possible that, someday, someone may write a scheduler that speculatively executes two instances of the same task on the same machine at the same time... does not follow. The point we're debating is not whether tracing is good, but whether including the task attempt ID as a permanent feature of the shuffle protocol is a good tradeoff for map/reduce.

          The information is available in general, but for near-certainty in this hypothetical situation one needs to examine task logs. That it's easier to collect from the TT logs isn't a good enough reason to impose this on maintainers; expecting them to find its purpose "documented" in the bug tracking system and retain the property in support of possible downstream processing is not reasonable.

          Show
          Chris Douglas added a comment - if the system's operation cannot be exposed at all then that greatly detracts from the usability of the system for users. [...] needing to perform time-based + host-based disambiguation of multiple attempts of the same task requires the assumption that the same task will not have multiple attempts being run on the same host, which is even more opaque than maintaining a publicly-documented (i.e. with this JIRA) side-effect. I think you and I differ in our perspectives of users. Even granting the premise that traceability greatly improves diagnosis of system problems, that Hadoop will be less usable to those writing map/reduce programs because it's possible that, someday, someone may write a scheduler that speculatively executes two instances of the same task on the same machine at the same time... does not follow. The point we're debating is not whether tracing is good, but whether including the task attempt ID as a permanent feature of the shuffle protocol is a good tradeoff for map/reduce. The information is available in general, but for near-certainty in this hypothetical situation one needs to examine task logs. That it's easier to collect from the TT logs isn't a good enough reason to impose this on maintainers; expecting them to find its purpose "documented" in the bug tracking system and retain the property in support of possible downstream processing is not reasonable.
          Hide
          Jiaqi Tan added a comment -

          I would argue that you cannot actually get the information accurately from TT logs; doing that would require the assumption of clock synchronization which is a common enough assumption in academia but in practice does not always hold true. The clienttrace messages are logged on the TT which is serving the map outputs, and are timestamped with the clock of the TT serving the map outputs. But the reduce attempts are running on other TTs, and those log messages that record the running of the reduce attempts are timestamped with the clock of the TT running the reduce. The only way to get ground truth that reflects causality is to pass the reduce attempt ID unless you have a totally ordered global clock in the system. We could get that from the JT, but unfortunately the job history logs for some reason record 0 times for failed attempts.

          What I'm suggesting is that short of putting the reduce attempt ID in the shuffle protocol, there is no other way of obtaining that information accurately, any other way is subject to assumptions such as clock synchronization or assumptions about the scheduling protocol. Also, in previous versions, the shuffle passed the entire reduce attempt ID in the HTTP request IIRC, so I'm not seeing why that's an issue now to put it back.

          Show
          Jiaqi Tan added a comment - I would argue that you cannot actually get the information accurately from TT logs; doing that would require the assumption of clock synchronization which is a common enough assumption in academia but in practice does not always hold true. The clienttrace messages are logged on the TT which is serving the map outputs, and are timestamped with the clock of the TT serving the map outputs. But the reduce attempts are running on other TTs, and those log messages that record the running of the reduce attempts are timestamped with the clock of the TT running the reduce. The only way to get ground truth that reflects causality is to pass the reduce attempt ID unless you have a totally ordered global clock in the system. We could get that from the JT, but unfortunately the job history logs for some reason record 0 times for failed attempts. What I'm suggesting is that short of putting the reduce attempt ID in the shuffle protocol, there is no other way of obtaining that information accurately, any other way is subject to assumptions such as clock synchronization or assumptions about the scheduling protocol. Also, in previous versions, the shuffle passed the entire reduce attempt ID in the HTTP request IIRC, so I'm not seeing why that's an issue now to put it back.
          Hide
          Chris Douglas added a comment -

          As I said earlier, if you wanted to replace the reduce parameter with the attempt id- and extract the former from the latter- I wouldn't oppose it. What bothers me is that, other than logging it to the clienttrace on behalf of a possible downstream tool, the parameter serves no functional purpose to map/reduce; it is likely to be "optimized" away. If you elect to do this, you should also add a comment explaining why the full attempt id is included in the request.

          Show
          Chris Douglas added a comment - As I said earlier, if you wanted to replace the reduce parameter with the attempt id- and extract the former from the latter- I wouldn't oppose it. What bothers me is that, other than logging it to the clienttrace on behalf of a possible downstream tool, the parameter serves no functional purpose to map/reduce; it is likely to be "optimized" away. If you elect to do this, you should also add a comment explaining why the full attempt id is included in the request.
          Hide
          Jiaqi Tan added a comment -

          Update: currently working on testing the patch for performance impact will have patch available by end of week or early next week.

          Show
          Jiaqi Tan added a comment - Update: currently working on testing the patch for performance impact will have patch available by end of week or early next week.
          Hide
          Jiaqi Tan added a comment -

          Updated patch for sending attempt ID in shuffle, extracting reduce partition number from attempt ID, and logging attempt ID in clienttrace message

          Show
          Jiaqi Tan added a comment - Updated patch for sending attempt ID in shuffle, extracting reduce partition number from attempt ID, and logging attempt ID in clienttrace message
          Jiaqi Tan made changes -
          Attachment MAPREDUCE-479-4.patch [ 12415747 ]
          Hide
          Jiaqi Tan added a comment -

          Did microbenchmark of shuffle durations with and without added reduce attempt ID transmission and reduce partition number extraction; shuffle times before and after this patch are statistically comparable (chi-squared test for distribution similarity of shuffle times, p-value 0.23 => null-hypothesis of statistically different distributions not rejected); thus this patch does not cause any performance impact.

          Show
          Jiaqi Tan added a comment - Did microbenchmark of shuffle durations with and without added reduce attempt ID transmission and reduce partition number extraction; shuffle times before and after this patch are statistically comparable (chi-squared test for distribution similarity of shuffle times, p-value 0.23 => null-hypothesis of statistically different distributions not rejected); thus this patch does not cause any performance impact.
          Jiaqi Tan made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note Adds Reduce Attempt ID to ClientTrace log messages, and adds Reduce Attempt ID to HTTP query string sent to mapOutputServlet. Adds Reduce Attempt ID to ClientTrace log messages, and adds Reduce Attempt ID to HTTP query string sent to mapOutputServlet. Extracts partition number from attempt ID.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12415747/MAPREDUCE-479-4.patch
          against trunk revision 801959.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/453/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/453/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/453/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/453/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415747/MAPREDUCE-479-4.patch against trunk revision 801959. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/453/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/453/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/453/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/453/console This message is automatically generated.
          Hide
          Jiaqi Tan added a comment -

          Failed QA tests due to other patch that was also in the same build; this patch did not change anything in mapred.lib but failed tests were in mapred.lib.

          Show
          Jiaqi Tan added a comment - Failed QA tests due to other patch that was also in the same build; this patch did not change anything in mapred.lib but failed tests were in mapred.lib.
          Hide
          Chris Douglas added a comment -

          +1

          I committed this. Thanks Jiaqi!

          Show
          Chris Douglas added a comment - +1 I committed this. Thanks Jiaqi!
          Chris Douglas made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Incompatible change, Reviewed]
          Resolution Fixed [ 1 ]
          Tom White made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Jiaqi Tan
              Reporter:
              Jiaqi Tan
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development