Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2324

Job should fail if a reduce task can't be scheduled anywhere

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.2, 0.20.205.0
    • Fix Version/s: 0.20.205.0
    • Component/s: None
    • Labels:
      None

      Description

      If there's a reduce task that needs more disk space than is available on any mapred.local.dir in the cluster, that task will stay pending forever. For example, we produced this in a QA cluster by accidentally running terasort with one reducer - since no mapred.local.dir had 1T free, the job remained in pending state for several days. The reason for the "stuck" task wasn't clear from a user perspective until we looked at the JT logs.

      Probably better to just fail the job if a reduce task goes through all TTs and finds that there isn't enough space.

      1. MR-2324-secutiry-just-log-v1.patch
        0.7 kB
        Robert Joseph Evans
      2. MR-2324-security-v3.patch
        10 kB
        Robert Joseph Evans
      3. MR-2324-security-v2.txt
        9 kB
        Robert Joseph Evans
      4. MR-2324-security-v1.txt
        9 kB
        Robert Joseph Evans
      5. MR-2324-disable-check-v2.patch
        0.9 kB
        Robert Joseph Evans

        Activity

        Todd Lipcon created issue -
        Hide
        Allen Wittenauer added a comment -

        We just hit this a few weeks ago. Thanks for filing a bug, because I forgot to do it.

        Show
        Allen Wittenauer added a comment - We just hit this a few weeks ago. Thanks for filing a bug, because I forgot to do it.
        Robert Joseph Evans made changes -
        Field Original Value New Value
        Assignee Robert Joseph Evans [ revans2 ]
        Hide
        Robert Joseph Evans added a comment -

        I just found the same issue and I am looking into what is the best way to solve it.

        If mapreduce.reduce.input.limit is mis-configured or if a cluster is just running low on disk space in general then reduces with large a input may never get scheduled causing the Job to never fail and never succeed, just starve until the job is killed.

        The JobInProgess tries to guess at the size of the input to all reducers in a job. If the size is over mapreduce.reduce.input.limit then the job is killed. If it is not then findNewReduceTask() checks to see if the estimated size is too big to fit on the node currently looking for work. If it is not then it will let some other task have a chance at the slot.

        The idea is to keep track of how often it happens that a Reduce Slot is rejected because of the lack of space vs how often it succeeds and then guess if the reduce tasks will ever be scheduled.

        So I would like some feedback on this.

        1) How should we guess. Someone who found the bug here suggested P1 + (P2 * S), where S is the number of successful assignments. Possibly P1 = 20 and P2 = 2.0. I am not really sure.
        2) What should we do when we guess that it will never get a slot? Should we fail the job or do we say, even though it might fail, well lets just schedule the it and see if it really will fail.

        Show
        Robert Joseph Evans added a comment - I just found the same issue and I am looking into what is the best way to solve it. If mapreduce.reduce.input.limit is mis-configured or if a cluster is just running low on disk space in general then reduces with large a input may never get scheduled causing the Job to never fail and never succeed, just starve until the job is killed. The JobInProgess tries to guess at the size of the input to all reducers in a job. If the size is over mapreduce.reduce.input.limit then the job is killed. If it is not then findNewReduceTask() checks to see if the estimated size is too big to fit on the node currently looking for work. If it is not then it will let some other task have a chance at the slot. The idea is to keep track of how often it happens that a Reduce Slot is rejected because of the lack of space vs how often it succeeds and then guess if the reduce tasks will ever be scheduled. So I would like some feedback on this. 1) How should we guess. Someone who found the bug here suggested P1 + (P2 * S), where S is the number of successful assignments. Possibly P1 = 20 and P2 = 2.0. I am not really sure. 2) What should we do when we guess that it will never get a slot? Should we fail the job or do we say, even though it might fail, well lets just schedule the it and see if it really will fail.
        Robert Joseph Evans made changes -
        Affects Version/s 0.20.205.0 [ 12316391 ]
        Hide
        Robert Joseph Evans added a comment -

        OK I have thought about it and talk with some people about statistics/scheduling and the like, and the conclusion that I have come to is the following.

        We should add in a new configuration parameter called mapreduce.reduce.input.limit.attempt.factor

        This value would default to 1.0 and be used to determine the number of times a reduce task can be rejected because the estimated input size will not fit before the job is killed. So if (#failedAttempts > (#ofActiveNodes * attempt.factor)) then kill the job.

        Show
        Robert Joseph Evans added a comment - OK I have thought about it and talk with some people about statistics/scheduling and the like, and the conclusion that I have come to is the following. We should add in a new configuration parameter called mapreduce.reduce.input.limit.attempt.factor This value would default to 1.0 and be used to determine the number of times a reduce task can be rejected because the estimated input size will not fit before the job is killed. So if (#failedAttempts > (#ofActiveNodes * attempt.factor)) then kill the job.
        Hide
        Todd Lipcon added a comment -

        It seems we need to not just count number of rejections, but actually the unique TTs. Especially with out-of-band heartbeats, it's possible some TT might heartbeat many times before another one heartbeats at all. This certainly uses more memory, but in MR-279, the memory usage in the JT isn't as big a deal right?

        Show
        Todd Lipcon added a comment - It seems we need to not just count number of rejections, but actually the unique TTs. Especially with out-of-band heartbeats, it's possible some TT might heartbeat many times before another one heartbeats at all. This certainly uses more memory, but in MR-279, the memory usage in the JT isn't as big a deal right?
        Hide
        Robert Joseph Evans added a comment -

        I have not really thought about this yet in reference to MR-279. I am more concerned about a sustaining release for the 0.20.20X line. And then look at porting the functionality to MR-279.

        I am not sure that it even applies to MR-279 because of how scheduling is different. My understanding is that the ApplicationMaster will make a request to the ResourceManager for a set of nodes that meet criteria X (I believe that disk space is one of the criteria you can request but it is currently ignored). The ResourceManager looks at all of the nodes available and hands back a list of nodes that best match the given criteria. So the ApplicationMaster has no idea at all which, if any nodes were considered and rejected, or even what all of the nodes in the system are. If we wanted to keep track of individual nodes it would either have to be on the ResourceManager, which does have resource constraints, or in the ApplicationMaster which would now need a list of all nodes in the cluster along with which nodes were tried and rejected for which reasons.

        In fact mapreduce.reduce.input.limit is not in the MR-279 code base at all, so for MR-279 we need to think about resource limits and scheduling more generally.

        Show
        Robert Joseph Evans added a comment - I have not really thought about this yet in reference to MR-279. I am more concerned about a sustaining release for the 0.20.20X line. And then look at porting the functionality to MR-279. I am not sure that it even applies to MR-279 because of how scheduling is different. My understanding is that the ApplicationMaster will make a request to the ResourceManager for a set of nodes that meet criteria X (I believe that disk space is one of the criteria you can request but it is currently ignored). The ResourceManager looks at all of the nodes available and hands back a list of nodes that best match the given criteria. So the ApplicationMaster has no idea at all which, if any nodes were considered and rejected, or even what all of the nodes in the system are. If we wanted to keep track of individual nodes it would either have to be on the ResourceManager, which does have resource constraints, or in the ApplicationMaster which would now need a list of all nodes in the cluster along with which nodes were tried and rejected for which reasons. In fact mapreduce.reduce.input.limit is not in the MR-279 code base at all, so for MR-279 we need to think about resource limits and scheduling more generally.
        Hide
        Robert Joseph Evans added a comment -

        I am uploading this patch based off of my initial proposal for limiting the maximum number of times that we try to schedule a reduce task and it fails because of size issues. This patch is only intended for the security branch, not trunk or MR-279. We still need to have a discussion about how MR-279 will handle these issues.

        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 6 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
        [exec]

        Show
        Robert Joseph Evans added a comment - I am uploading this patch based off of my initial proposal for limiting the maximum number of times that we try to schedule a reduce task and it fails because of size issues. This patch is only intended for the security branch, not trunk or MR-279. We still need to have a discussion about how MR-279 will handle these issues. [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec]
        Robert Joseph Evans made changes -
        Attachment MR-2324-security-v1.txt [ 12486489 ]
        Robert Joseph Evans made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12486489/MR-2324-security-v1.txt
        against trunk revision 1146517.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/469//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12486489/MR-2324-security-v1.txt against trunk revision 1146517. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/469//console This message is automatically generated.
        Hide
        Allen Wittenauer added a comment -

        > I am more concerned about a sustaining release for the 0.20.20X line

        If a "real" fix will require a different config param, I'd rather see this bumped to 0.23. This has been a known (and annoying) bug for a long time, but doesn't really require an immediate, sustaining fix if that fix is going to be incomplete and ripped out 6 months later in a newer branch.

        Show
        Allen Wittenauer added a comment - > I am more concerned about a sustaining release for the 0.20.20X line If a "real" fix will require a different config param, I'd rather see this bumped to 0.23. This has been a known (and annoying) bug for a long time, but doesn't really require an immediate, sustaining fix if that fix is going to be incomplete and ripped out 6 months later in a newer branch.
        Hide
        Robert Joseph Evans added a comment -

        I don't believe that the fix I submitted is incomplete the issue is that MRv2 does things so very differently we need to tackle the problem in a different way. I am sure the patch is not perfect and I am very happy to see any better ideas/patches. Also I am getting noise from my customers about this so I would like to see a fix in a sustaining release. It is not a lot of noise but I do have to at least try to get a fix in.

        I do agree that having different configuration values is an issue that I would like to avoid, but currently 0.23 has dropped mapreduce.reduce.input.limit all together along with who knows what other configuration values. I do not see any way to maintain mapreduce.reduce.input.limit in MRv2.

        I have started looking at the scheduler code in yarn and this is just preliminary but it looks like what we want to do is to extend Resource to include disk space not just RAM. The NodeManager can then also report back the amount of disk space that it has free, just like the TaskTracker does. Then for Reduce Tasks we teh MR Application Master can request the container based off of the estimated reduce input size. We can also put in a more generic resource starvation detection mechanism that would work for both RAM and Disk.

        Show
        Robert Joseph Evans added a comment - I don't believe that the fix I submitted is incomplete the issue is that MRv2 does things so very differently we need to tackle the problem in a different way. I am sure the patch is not perfect and I am very happy to see any better ideas/patches. Also I am getting noise from my customers about this so I would like to see a fix in a sustaining release. It is not a lot of noise but I do have to at least try to get a fix in. I do agree that having different configuration values is an issue that I would like to avoid, but currently 0.23 has dropped mapreduce.reduce.input.limit all together along with who knows what other configuration values. I do not see any way to maintain mapreduce.reduce.input.limit in MRv2. I have started looking at the scheduler code in yarn and this is just preliminary but it looks like what we want to do is to extend Resource to include disk space not just RAM. The NodeManager can then also report back the amount of disk space that it has free, just like the TaskTracker does. Then for Reduce Tasks we teh MR Application Master can request the container based off of the estimated reduce input size. We can also put in a more generic resource starvation detection mechanism that would work for both RAM and Disk.
        Hide
        Robert Joseph Evans added a comment -

        Looking back I realize that I probably have not answered Todd's question satisfactorily. Yes there are out of band heartbeats, and in fact not every TT heartbeat will make it all the way through to this piece code, because the node may have no slots available by the time it gets to this Job. The intention was not to verify that the job has been tried on every TT before giving up. The idea was to do a reasonable effort in trying to schedule the job before giving up. I suspect that the amount of free disk space on a node may very quite a bit between heartbeats, just because jobs are using disk space that then go away, HDFS is storing a file that is deleted, or several new blocks are added, so even if we give every node a chance at this job before giving up there is still a possibility that it will succeed later on. We cannot predict the future, but we do need to put an upper bound on how long we try to do something, otherwise there will always be corner cases where we can get starvation.

        It may also make since to use some statistical heuristics in MR-279 to try and give up sooner rather then later if someone is asking for something that is really outside of the norm. But that is just an optimization.

        Show
        Robert Joseph Evans added a comment - Looking back I realize that I probably have not answered Todd's question satisfactorily. Yes there are out of band heartbeats, and in fact not every TT heartbeat will make it all the way through to this piece code, because the node may have no slots available by the time it gets to this Job. The intention was not to verify that the job has been tried on every TT before giving up. The idea was to do a reasonable effort in trying to schedule the job before giving up. I suspect that the amount of free disk space on a node may very quite a bit between heartbeats, just because jobs are using disk space that then go away, HDFS is storing a file that is deleted, or several new blocks are added, so even if we give every node a chance at this job before giving up there is still a possibility that it will succeed later on. We cannot predict the future, but we do need to put an upper bound on how long we try to do something, otherwise there will always be corner cases where we can get starvation. It may also make since to use some statistical heuristics in MR-279 to try and give up sooner rather then later if someone is asking for something that is really outside of the norm. But that is just an optimization.
        Hide
        Robert Joseph Evans added a comment -

        I uploaded a patch a while ago and the conversation has kind of died off. Can someone please review the patch and give me some feedback on it. If it is something that you don't want to put into a sustaining release at this time then please give me some feedback possibly with a -1, depending on how adamant you are about it, so I can address those issues perhaps by fixing it just in 0.23 instead.

        Show
        Robert Joseph Evans added a comment - I uploaded a patch a while ago and the conversation has kind of died off. Can someone please review the patch and give me some feedback on it. If it is something that you don't want to put into a sustaining release at this time then please give me some feedback possibly with a -1, depending on how adamant you are about it, so I can address those issues perhaps by fixing it just in 0.23 instead.
        Hide
        Todd Lipcon added a comment -

        Hey Bobby. Sorry, was on vacation last week so only partially keeping up with JIRA traffic.

        My worry mostly has to do with this feature being kicked in as a false positive. In general, false positives here are very expensive, whereas false negatives are not nearly as drastic.

        For example, imagine a cluster with 10 nodes and a couple of jobs submitted. One of the nodes is out of disk space. The first job, when submitted, takes up all the reduce slots on the first 9 nodes, but the 10th node is left empty since it's out of space. When the second job is submitted, all of the free reduce slots on the cluster are located on this remaining node. Every time the node heartbeats, the counter will get incremented for the queued up job. After 10 heartbeats, the job will fail, even though it was just a single problematic node.

        So, I think we do need to wait for a scheduling opportunity on at least some number of unique nodes before failing the job. It seems we could do this with a single HashSet per job - whenever any reduce task is successfully scheduld, the set is cleared. Whenever a job is given an opportunity to schedule reduces on a node, but can't due to resource constraints, it's added to the set. Once the size of the set eclipses some percentage of the nodes on the cluster, it fails the job. This memory usage would be O(nodes*jobs) rather than O(nodes*tasks) – and thus not too bad.

        Show
        Todd Lipcon added a comment - Hey Bobby. Sorry, was on vacation last week so only partially keeping up with JIRA traffic. My worry mostly has to do with this feature being kicked in as a false positive. In general, false positives here are very expensive, whereas false negatives are not nearly as drastic. For example, imagine a cluster with 10 nodes and a couple of jobs submitted. One of the nodes is out of disk space. The first job, when submitted, takes up all the reduce slots on the first 9 nodes, but the 10th node is left empty since it's out of space. When the second job is submitted, all of the free reduce slots on the cluster are located on this remaining node. Every time the node heartbeats, the counter will get incremented for the queued up job. After 10 heartbeats, the job will fail, even though it was just a single problematic node. So, I think we do need to wait for a scheduling opportunity on at least some number of unique nodes before failing the job. It seems we could do this with a single HashSet per job - whenever any reduce task is successfully scheduld, the set is cleared. Whenever a job is given an opportunity to schedule reduces on a node, but can't due to resource constraints, it's added to the set. Once the size of the set eclipses some percentage of the nodes on the cluster, it fails the job. This memory usage would be O(nodes*jobs) rather than O(nodes*tasks) – and thus not too bad.
        Hide
        Robert Joseph Evans added a comment -

        That is a very good point and I really like the solution. I will incorporate your comments and upload a new patch.

        Show
        Robert Joseph Evans added a comment - That is a very good point and I really like the solution. I will incorporate your comments and upload a new patch.
        Hide
        Robert Joseph Evans added a comment -

        Uploading new patch.

        Show
        Robert Joseph Evans added a comment - Uploading new patch.
        Robert Joseph Evans made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Robert Joseph Evans added a comment -

        [exec]
        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 6 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        I also did some math. On our largest cluster here at Yahoo! we have < 5000 machines and at most about 200 jobs running concurrently. That comes out to about 8-16 MB in extra heap usage on the JT, if the HashMap is half full and all of those 200 jobs are about to fail because of reduce scheduling issues.

        Show
        Robert Joseph Evans added a comment - [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. I also did some math. On our largest cluster here at Yahoo! we have < 5000 machines and at most about 200 jobs running concurrently. That comes out to about 8-16 MB in extra heap usage on the JT, if the HashMap is half full and all of those 200 jobs are about to fail because of reduce scheduling issues.
        Robert Joseph Evans made changes -
        Attachment MR-2324-security-v2.txt [ 12486922 ]
        Robert Joseph Evans made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12486922/MR-2324-security-v2.txt
        against trunk revision 1147981.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/478//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12486922/MR-2324-security-v2.txt against trunk revision 1147981. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/478//console This message is automatically generated.
        Hide
        Todd Lipcon added a comment -

        Not sure if I'll have time to review this in the next couple days. Anyone over there who could review for you? Otherwise I'll try to look by the end of the week.

        Show
        Todd Lipcon added a comment - Not sure if I'll have time to review this in the next couple days. Anyone over there who could review for you? Otherwise I'll try to look by the end of the week.
        Robert Joseph Evans made changes -
        Fix Version/s 0.20.205.0 [ 12316391 ]
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1164/
        -----------------------------------------------------------

        Review request for hadoop-mapreduce, Todd Lipcon, Tom Graves, and Jeffrey Naisbitt.

        Summary
        -------

        Job should fail if a reduce task can't be scheduled anywhere. V2 of the patch.

        This addresses bug MAPREDUCE-2324.
        https://issues.apache.org/jira/browse/MAPREDUCE-2324

        Diffs


        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java 1148035
        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/TaskTracker.java 1148035
        branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/MiniMRCluster.java 1148035
        branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java 1148035

        Diff: https://reviews.apache.org/r/1164/diff

        Testing
        -------

        Unit tests and ran manual tests on a single node cluster.

        Thanks,

        Robert

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1164/ ----------------------------------------------------------- Review request for hadoop-mapreduce, Todd Lipcon, Tom Graves, and Jeffrey Naisbitt. Summary ------- Job should fail if a reduce task can't be scheduled anywhere. V2 of the patch. This addresses bug MAPREDUCE-2324 . https://issues.apache.org/jira/browse/MAPREDUCE-2324 Diffs branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java 1148035 branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/TaskTracker.java 1148035 branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/MiniMRCluster.java 1148035 branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java 1148035 Diff: https://reviews.apache.org/r/1164/diff Testing ------- Unit tests and ran manual tests on a single node cluster. Thanks, Robert
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/1164/#review1157
        -----------------------------------------------------------

        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java
        <https://reviews.apache.org/r/1164/#comment2386>

        I think a default of 0.8 or so would probably make more sense – just in case one of the TTs is in some bad state where it isn't heartbeating, we don't want to wait forever.

        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java
        <https://reviews.apache.org/r/1164/#comment2387>

        since this is a set of trackers, not attempts, a better name might be: failedReduceSchedulingTrackers, or something?

        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java
        <https://reviews.apache.org/r/1164/#comment2388>

        this key should probably be defined as a constant in MRJobConfig, right?

        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java
        <https://reviews.apache.org/r/1164/#comment2389>

        refactor this into a new method?

        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java
        <https://reviews.apache.org/r/1164/#comment2391>

        I think the operator precedence is off here.

        (int)reduceInputAttemptFactor is higher precedence, so it will end up rounding anything < 1.0 down to 0.

        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java
        <https://reviews.apache.org/r/1164/#comment2390>

        style: add space between if and (

        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java
        <https://reviews.apache.org/r/1164/#comment2392>

        I think we mostly avoid the 1st person in error messages. Change to "Tried to schedule..." rather than "We tried"...

        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java
        <https://reviews.apache.org/r/1164/#comment2393>

        StringUtils.humanReadableInt might be useful here.

        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java
        <https://reviews.apache.org/r/1164/#comment2394>

        jobId is the job, not the task, right?

        branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java
        <https://reviews.apache.org/r/1164/#comment2395>

        typos: "failes", "then that"

        branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java
        <https://reviews.apache.org/r/1164/#comment2396>

        isn't the default input limit unlimited? why do we need this?

        branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java
        <https://reviews.apache.org/r/1164/#comment2397>

        we should check that the failure info of the job has the correct type of error message (ie that it didn't fail due to some other error)

        • Todd

        On 2011-07-21 18:49:31, Robert Evans wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/1164/

        -----------------------------------------------------------

        (Updated 2011-07-21 18:49:31)

        Review request for hadoop-mapreduce, Todd Lipcon, Tom Graves, and Jeffrey Naisbitt.

        Summary

        -------

        Job should fail if a reduce task can't be scheduled anywhere. V2 of the patch.

        This addresses bug MAPREDUCE-2324.

        https://issues.apache.org/jira/browse/MAPREDUCE-2324

        Diffs

        -----

        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java 1148035

        branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/TaskTracker.java 1148035

        branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/MiniMRCluster.java 1148035

        branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java 1148035

        Diff: https://reviews.apache.org/r/1164/diff

        Testing

        -------

        Unit tests and ran manual tests on a single node cluster.

        Thanks,

        Robert

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1164/#review1157 ----------------------------------------------------------- branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java < https://reviews.apache.org/r/1164/#comment2386 > I think a default of 0.8 or so would probably make more sense – just in case one of the TTs is in some bad state where it isn't heartbeating, we don't want to wait forever. branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java < https://reviews.apache.org/r/1164/#comment2387 > since this is a set of trackers, not attempts, a better name might be: failedReduceSchedulingTrackers, or something? branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java < https://reviews.apache.org/r/1164/#comment2388 > this key should probably be defined as a constant in MRJobConfig, right? branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java < https://reviews.apache.org/r/1164/#comment2389 > refactor this into a new method? branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java < https://reviews.apache.org/r/1164/#comment2391 > I think the operator precedence is off here. (int)reduceInputAttemptFactor is higher precedence, so it will end up rounding anything < 1.0 down to 0. branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java < https://reviews.apache.org/r/1164/#comment2390 > style: add space between if and ( branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java < https://reviews.apache.org/r/1164/#comment2392 > I think we mostly avoid the 1st person in error messages. Change to "Tried to schedule..." rather than "We tried"... branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java < https://reviews.apache.org/r/1164/#comment2393 > StringUtils.humanReadableInt might be useful here. branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java < https://reviews.apache.org/r/1164/#comment2394 > jobId is the job, not the task, right? branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java < https://reviews.apache.org/r/1164/#comment2395 > typos: "failes", "then that" branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java < https://reviews.apache.org/r/1164/#comment2396 > isn't the default input limit unlimited? why do we need this? branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java < https://reviews.apache.org/r/1164/#comment2397 > we should check that the failure info of the job has the correct type of error message (ie that it didn't fail due to some other error) Todd On 2011-07-21 18:49:31, Robert Evans wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/1164/ ----------------------------------------------------------- (Updated 2011-07-21 18:49:31) Review request for hadoop-mapreduce, Todd Lipcon, Tom Graves, and Jeffrey Naisbitt. Summary ------- Job should fail if a reduce task can't be scheduled anywhere. V2 of the patch. This addresses bug MAPREDUCE-2324 . https://issues.apache.org/jira/browse/MAPREDUCE-2324 Diffs ----- branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/JobInProgress.java 1148035 branches/branch-0.20-security/src/mapred/org/apache/hadoop/mapred/TaskTracker.java 1148035 branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/MiniMRCluster.java 1148035 branches/branch-0.20-security/src/test/org/apache/hadoop/mapred/TestTaskLimits.java 1148035 Diff: https://reviews.apache.org/r/1164/diff Testing ------- Unit tests and ran manual tests on a single node cluster. Thanks, Robert
        Robert Joseph Evans made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Robert Joseph Evans added a comment -

        I incorporated the comments from Todd's Review (Thanks Todd). I did not add the config value into MRJobConf.java because the patch is for 0.20.205 which does not have MRJobConf.java yet.

        Show
        Robert Joseph Evans added a comment - I incorporated the comments from Todd's Review (Thanks Todd). I did not add the config value into MRJobConf.java because the patch is for 0.20.205 which does not have MRJobConf.java yet.
        Robert Joseph Evans made changes -
        Attachment MR-2324-security-v3.patch [ 12487365 ]
        Robert Joseph Evans made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Todd Lipcon added a comment -

        since we have a simple non-heuristic based solution, can this be done for MR-279 too? seems like the same method would apply (though I don't know the code base)

        Show
        Todd Lipcon added a comment - since we have a simple non-heuristic based solution, can this be done for MR-279 too? seems like the same method would apply (though I don't know the code base)
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12487365/MR-2324-security-v3.patch
        against trunk revision 1149323.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/492//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12487365/MR-2324-security-v3.patch against trunk revision 1149323. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/492//console This message is automatically generated.
        Hide
        Robert Joseph Evans added a comment -

        I have been looking at MR-279 and I want to do something similar it is just not really set up to do it easily. The scheduling is split up between the resource manager and the application master. And in fact the resource manager/application master are completely ignoring disk utilization at this point.

        The plan is to add in disk utilization to the resources that the RM uses, and then have AM request both disk and RAM space for reduces with disk space based off of the size estimate currently used. Then inside the scheduler, which is the right place in my opinion to decide if a request is being starved or not, it would do just what we do not but more generalized for all resource constraints, not just disk. This means that all schedulers would have to be modified to support this, but I can make the code generic so it should be fairly simple to do. I just need to dig into the MR-279 code to decide exactly how I want to insert this in. I should hopefully have a patch by mid next week.

        Show
        Robert Joseph Evans added a comment - I have been looking at MR-279 and I want to do something similar it is just not really set up to do it easily. The scheduling is split up between the resource manager and the application master. And in fact the resource manager/application master are completely ignoring disk utilization at this point. The plan is to add in disk utilization to the resources that the RM uses, and then have AM request both disk and RAM space for reduces with disk space based off of the size estimate currently used. Then inside the scheduler, which is the right place in my opinion to decide if a request is being starved or not, it would do just what we do not but more generalized for all resource constraints, not just disk. This means that all schedulers would have to be modified to support this, but I can make the code generic so it should be fairly simple to do. I just need to dig into the MR-279 code to decide exactly how I want to insert this in. I should hopefully have a patch by mid next week.
        Hide
        Robert Joseph Evans added a comment -

        The following is the updated manual run of test-patch for V3

        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 6 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
        [exec]

        Show
        Robert Joseph Evans added a comment - The following is the updated manual run of test-patch for V3 [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec]
        Hide
        Robert Joseph Evans added a comment -

        I just filed MAPREDUCE-2723 as a subtask to port these changes to MRV2.

        Show
        Robert Joseph Evans added a comment - I just filed MAPREDUCE-2723 as a subtask to port these changes to MRV2.
        Hide
        Arun C Murthy added a comment -

        Sorry for coming in late. I'm a little worried that this is too elaborate a fix for a very rare corner case...

        For e.g. your worst case is something like 5k jobs in a 5k node cluster:
        5000 * 5000 * 100bytes -> 2.5G!

        Admittedly that is far-fetched, yet illustrates the (remote) risk for a rare corner case.

        Robert - how many times have you seen this?

        Show
        Arun C Murthy added a comment - Sorry for coming in late. I'm a little worried that this is too elaborate a fix for a very rare corner case... For e.g. your worst case is something like 5k jobs in a 5k node cluster: 5000 * 5000 * 100bytes -> 2.5G! Admittedly that is far-fetched, yet illustrates the (remote) risk for a rare corner case. Robert - how many times have you seen this?
        Hide
        Robert Joseph Evans added a comment -

        We saw this in a few cases recently but only a small handful. The error message does not say which job it is that cannot be scheduled so we have to manually look at all of the jobs to try and determine which one is causing the problem. So at a minimum we need to update the error message to include the name of the job.

        The reason we are seeing this is mostly because on our clusters mapreduce.reduce.input.limit is disabled because it was causing a lot of false positives. Killing jobs that would pass otherwise which we decided was worse then trying to find and manually kill some bad jobs.

        As for your size estimate I contend that it is way off. First of all it is not 100 bytes. It is probably closer to 8 bytes, because we are storing a reference to a string that all 5000 failing jobs will share, but lets assume that it is 100 bytes. For 5000 jobs to be accurate that would mean that somehow 5000 map/reduce jobs each with only 1 or 2 reducers, and a HUGE amount of data going to those reducers, were launched and all are getting to the reducer stage at almost exactly the same time. I don't think in practice that would ever happen It would take someone maliciously trying to do this, and launching 5000 jobs at all would probably be shut down by queue limits before it even got here. I looked at our clusters here and we have at most about 300 jobs running at any point in time on our very large clusters.

        So I would say, as far as an upper bound is concerned, a closer estimate would be 5000 nodes * 100 * 600 jobs. This comes out to be about 300MB in the worst case. I suspect that the worst case is going to be a lot closer to (5000 nodes * 100 bytes) + (8 byte references to strings * 5000 nodes * 80% of nodes before job killed * 400 jobs) = about 13MB.

        I agree that there is some uncertainty about how this might perform at a large scale. So I will try to get some time on our large test cluster to see if I can run a gridmix simulation to verify that it is not going to cause anything bad to happen at scale, and I will also submit a patch that will just change the error message to include the job ID to make it simpler to manually kill off jobs that are stuck.

        Show
        Robert Joseph Evans added a comment - We saw this in a few cases recently but only a small handful. The error message does not say which job it is that cannot be scheduled so we have to manually look at all of the jobs to try and determine which one is causing the problem. So at a minimum we need to update the error message to include the name of the job. The reason we are seeing this is mostly because on our clusters mapreduce.reduce.input.limit is disabled because it was causing a lot of false positives. Killing jobs that would pass otherwise which we decided was worse then trying to find and manually kill some bad jobs. As for your size estimate I contend that it is way off. First of all it is not 100 bytes. It is probably closer to 8 bytes, because we are storing a reference to a string that all 5000 failing jobs will share, but lets assume that it is 100 bytes. For 5000 jobs to be accurate that would mean that somehow 5000 map/reduce jobs each with only 1 or 2 reducers, and a HUGE amount of data going to those reducers, were launched and all are getting to the reducer stage at almost exactly the same time. I don't think in practice that would ever happen It would take someone maliciously trying to do this, and launching 5000 jobs at all would probably be shut down by queue limits before it even got here. I looked at our clusters here and we have at most about 300 jobs running at any point in time on our very large clusters. So I would say, as far as an upper bound is concerned, a closer estimate would be 5000 nodes * 100 * 600 jobs. This comes out to be about 300MB in the worst case. I suspect that the worst case is going to be a lot closer to (5000 nodes * 100 bytes) + (8 byte references to strings * 5000 nodes * 80% of nodes before job killed * 400 jobs) = about 13MB. I agree that there is some uncertainty about how this might perform at a large scale. So I will try to get some time on our large test cluster to see if I can run a gridmix simulation to verify that it is not going to cause anything bad to happen at scale, and I will also submit a patch that will just change the error message to include the job ID to make it simpler to manually kill off jobs that are stuck.
        Hide
        Robert Joseph Evans added a comment -

        Attaching a patch that is just a log change as a fall back.

        Show
        Robert Joseph Evans added a comment - Attaching a patch that is just a log change as a fall back.
        Robert Joseph Evans made changes -
        Attachment MR-2324-secutiry-just-log-v1.patch [ 12488230 ]
        Hide
        Robert Joseph Evans added a comment -

        The following are the results of running test-patch on the log only change. I really would prefer to get the full fix in, but this is just for completeness

        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
        [exec] Please justify why no tests are needed for this patch.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        I did not add or modify any tests because well it is a one line change to a log message.

        Show
        Robert Joseph Evans added a comment - The following are the results of running test-patch on the log only change. I really would prefer to get the full fix in, but this is just for completeness [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. I did not add or modify any tests because well it is a one line change to a log message.
        Hide
        Arun C Murthy added a comment -

        Good point about the string refs.

        Yes, I'd appreciate a scale test before we commit this.

        OTOH, I know we've tried before, but, fixing reduce.input.limit (if possible) would obviate the need for this and also help clusters a lot... do you want to try to take a look at it? Thanks.

        Show
        Arun C Murthy added a comment - Good point about the string refs. Yes, I'd appreciate a scale test before we commit this. OTOH, I know we've tried before, but, fixing reduce.input.limit (if possible) would obviate the need for this and also help clusters a lot... do you want to try to take a look at it? Thanks.
        Hide
        Robert Joseph Evans added a comment -

        I did initially look at trying to fix reduce.input.limit. Currently it is something that someone has to manually guess what the value should be. What is more this value is likely to need to change as the cluster fills up with data, and as data is deleted off of the cluster. If it is wrong then either too many jobs fail that would have succeeded or some jobs, probably a very small number, will starve and never finish.

        To fix it Hadoop would have to automatically set reduce.input.limit dynamically and the only way I can think of to do that would be to gather statistics about all the nodes in the cluster and try to predict how likely this particular reduce will ever find the space it needs on a node. I believe that we can compute the mean and X% confidence interval for disk space on the cluster without too much difficulty but I have my doubts that this will apply to a small cluster. From what I read anything under 40 samples tends to be suspect, so it might not work for a cluster under 40 nodes. Also I am not sure how the statistics would apply to this particular situation. Would we want to compute this based off of a recent history of cluster of just a snapshot of its current state? If there is history how far back would we want to go, and how would we handle some nodes heart-beating in more regularly then others. I am not a statistician and I could not find one to look over my work, so instead I decided to take a bit more of a brute force approach that I know would work.

        If you know a statistician that could provide a robust solution to this problem or at least tell me what if anything I am doing wrong then I am very happy to implement it.

        Show
        Robert Joseph Evans added a comment - I did initially look at trying to fix reduce.input.limit. Currently it is something that someone has to manually guess what the value should be. What is more this value is likely to need to change as the cluster fills up with data, and as data is deleted off of the cluster. If it is wrong then either too many jobs fail that would have succeeded or some jobs, probably a very small number, will starve and never finish. To fix it Hadoop would have to automatically set reduce.input.limit dynamically and the only way I can think of to do that would be to gather statistics about all the nodes in the cluster and try to predict how likely this particular reduce will ever find the space it needs on a node. I believe that we can compute the mean and X% confidence interval for disk space on the cluster without too much difficulty but I have my doubts that this will apply to a small cluster. From what I read anything under 40 samples tends to be suspect, so it might not work for a cluster under 40 nodes. Also I am not sure how the statistics would apply to this particular situation. Would we want to compute this based off of a recent history of cluster of just a snapshot of its current state? If there is history how far back would we want to go, and how would we handle some nodes heart-beating in more regularly then others. I am not a statistician and I could not find one to look over my work, so instead I decided to take a bit more of a brute force approach that I know would work. If you know a statistician that could provide a robust solution to this problem or at least tell me what if anything I am doing wrong then I am very happy to implement it.
        Hide
        Robert Joseph Evans added a comment -

        I have been able to run gridmix on a 10 node cluster, and everything looks stable. I have not been able to run it on anything larger because the processes here are really not set up to do that very easily. The process in the past has been to run gridmix at scale after the branch is in QA not before then so the tools are not setup to deploy from a dev branch. Plus I have to get approval from lots of people to make that happen. I am trying to see if I can still do it, but I am not very hopeful that it will happen any time soon.

        Show
        Robert Joseph Evans added a comment - I have been able to run gridmix on a 10 node cluster, and everything looks stable. I have not been able to run it on anything larger because the processes here are really not set up to do that very easily. The process in the past has been to run gridmix at scale after the branch is in QA not before then so the tools are not setup to deploy from a dev branch. Plus I have to get approval from lots of people to make that happen. I am trying to see if I can still do it, but I am not very hopeful that it will happen any time soon.
        Hide
        Arun C Murthy added a comment -

        Robert - the problem for reduce.input.limit was not 'right' value for the constant, but the fact that 'guessing' the reduce input was broken.

        For now, should we commit the logging change while you investigate if we can fix the 'guess'?

        Show
        Arun C Murthy added a comment - Robert - the problem for reduce.input.limit was not 'right' value for the constant, but the fact that 'guessing' the reduce input was broken. For now, should we commit the logging change while you investigate if we can fix the 'guess'?
        Hide
        Robert Joseph Evans added a comment -

        +1 that should be enough to unblock us on the sustaining release. And we can look at what the correct thing to do in YARN is.

        Show
        Robert Joseph Evans added a comment - +1 that should be enough to unblock us on the sustaining release. And we can look at what the correct thing to do in YARN is.
        Hide
        Arun C Murthy added a comment -

        On second thoughts, since ResourceEstimator.getEstimatedReduceInputSize is broken (as of now), one option is to not use it for comparing against available space on TT. Should we just disable that check?

        Show
        Arun C Murthy added a comment - On second thoughts, since ResourceEstimator.getEstimatedReduceInputSize is broken (as of now), one option is to not use it for comparing against available space on TT. Should we just disable that check?
        Hide
        Robert Joseph Evans added a comment -

        That would also fix the problem. I should be able to have a patch for that very quickly.

        Show
        Robert Joseph Evans added a comment - That would also fix the problem. I should be able to have a patch for that very quickly.
        Hide
        Koji Noguchi added a comment -

        Should we just disable that check?

        +1

        Show
        Koji Noguchi added a comment - Should we just disable that check? +1
        Robert Joseph Evans made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Robert Joseph Evans added a comment -

        I should have the test-patch results soon.

        Show
        Robert Joseph Evans added a comment - I should have the test-patch results soon.
        Robert Joseph Evans made changes -
        Attachment MR-2324-disable-check-v2.patch [ 12488474 ]
        Robert Joseph Evans made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Robert Joseph Evans added a comment -

        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
        [exec] Please justify why no tests are needed for this patch.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        I did not add in any tests because the change was to disable something that did not have any tests for it to begin with.

        Show
        Robert Joseph Evans added a comment - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. I did not add in any tests because the change was to disable something that did not have any tests for it to begin with.
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks Robert!

        Show
        Arun C Murthy added a comment - I just committed this. Thanks Robert!
        Arun C Murthy made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Matt Foley added a comment -

        Closed upon release of 0.20.205.0

        Show
        Matt Foley added a comment - Closed upon release of 0.20.205.0
        Matt Foley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Subroto Sanyal added a comment -

        Hi Todd, Murthy, Robert

        The issue targets to fix the problem in case of Reducer.
        As per fix that is committed, I can see the check for ResourceEstimator.getEstimatedReduceInputSize is removed from findNewReduceTask
        I have the following questions for the fix committed:

        • How about the same problem occurring in case of Mappers?
        • Say for example only one TaskTracker is having low disk space, as per the fix we go ahead and assign the Reduce task to it; which ends up in failure. So one failure which could have been reduced by the check.

        Regards,
        Subroto Sanyal

        Show
        Subroto Sanyal added a comment - Hi Todd, Murthy, Robert The issue targets to fix the problem in case of Reducer. As per fix that is committed, I can see the check for ResourceEstimator.getEstimatedReduceInputSize is removed from findNewReduceTask I have the following questions for the fix committed: How about the same problem occurring in case of Mappers? Say for example only one TaskTracker is having low disk space, as per the fix we go ahead and assign the Reduce task to it; which ends up in failure. So one failure which could have been reduced by the check. Regards, Subroto Sanyal
        Hide
        Robert Joseph Evans added a comment -

        I believe that the same issue could happen on the mapper side, except I have never seen it actually happen. We saw this actually happen on the reducer side in a few instances which is why I put in the patch. we could do something similar on the mapper side, but it was not as urgent.

        Yes in theory if the getEstimatedReduceInputSide worked correctly, and there was only one node with low disk space then that failure could have been avoided. I originally looked at implementing something that would try to assign the task to several different nodes before giving up. But at what point do we say that we have tried enough? Answering that question along with the memory requirements to do that for every single task attempt resulted in a very complicated solution. The reason this fix was selected was because it is very simple compaired to the other ones (less risk of breaking something) also getExtimatedReduceInputSize has some issues with it, Arun can better describe them then I can. He wanted to push for us to address those issues as the ultimate fix for this.

        I hope that helps.

        Show
        Robert Joseph Evans added a comment - I believe that the same issue could happen on the mapper side, except I have never seen it actually happen. We saw this actually happen on the reducer side in a few instances which is why I put in the patch. we could do something similar on the mapper side, but it was not as urgent. Yes in theory if the getEstimatedReduceInputSide worked correctly, and there was only one node with low disk space then that failure could have been avoided. I originally looked at implementing something that would try to assign the task to several different nodes before giving up. But at what point do we say that we have tried enough? Answering that question along with the memory requirements to do that for every single task attempt resulted in a very complicated solution. The reason this fix was selected was because it is very simple compaired to the other ones (less risk of breaking something) also getExtimatedReduceInputSize has some issues with it, Arun can better describe them then I can. He wanted to push for us to address those issues as the ultimate fix for this. I hope that helps.
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Patch Available Patch Available Open Open
        18d 3h 25m 3 Robert Joseph Evans 01/Aug/11 23:27
        Open Open Patch Available Patch Available
        150d 53m 4 Robert Joseph Evans 01/Aug/11 23:28
        Patch Available Patch Available Resolved Resolved
        20m 59s 1 Arun C Murthy 01/Aug/11 23:49
        Resolved Resolved Closed Closed
        78d 1h 36m 1 Matt Foley 19/Oct/11 01:26

          People

          • Assignee:
            Robert Joseph Evans
            Reporter:
            Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development