Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Issue Links

        Activity

        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Moving bugs out of previously closed releases into the next minor release 2.8.0.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Moving bugs out of previously closed releases into the next minor release 2.8.0.
        Hide
        cwelch Craig Welch added a comment -

        So, I think it will be very problematic to move the unregistration of the job ahead of the upload of the job history logs - as far as I know the grace period is just in the application master waiting and still accepting requests, the resource manager immediately begins forwarding clients at unregistration, which means that if we unregister and then upload the job history file we will definitely have a time period where clients will be sent to the job history server and will fail. Also, resource manager restarts are far less frequent then job completion/check occurrences, we don't want to cause problems with the latter to improve the situation with the former (I think Zhijie Shen & Jason Lowe made this point above, I concur...). I think the solution needs to be something like a rollback - where a job can change it's state back to one which causes the client to go to the am again, while clients looking directly at the job history server may still get different results, client going through the rm to get at state will again be directed to the am which has newly restarted. We could provide a mechanism to allow the am to purge it's state from the jobhistory server as well if this was a significant concern, to achieve full "state correctness" for this case.

        Show
        cwelch Craig Welch added a comment - So, I think it will be very problematic to move the unregistration of the job ahead of the upload of the job history logs - as far as I know the grace period is just in the application master waiting and still accepting requests, the resource manager immediately begins forwarding clients at unregistration, which means that if we unregister and then upload the job history file we will definitely have a time period where clients will be sent to the job history server and will fail. Also, resource manager restarts are far less frequent then job completion/check occurrences, we don't want to cause problems with the latter to improve the situation with the former (I think Zhijie Shen & Jason Lowe made this point above, I concur...). I think the solution needs to be something like a rollback - where a job can change it's state back to one which causes the client to go to the am again, while clients looking directly at the job history server may still get different results, client going through the rm to get at state will again be directed to the am which has newly restarted. We could provide a mechanism to allow the am to purge it's state from the jobhistory server as well if this was a significant concern, to achieve full "state correctness" for this case.
        Hide
        revans2 Robert Joseph Evans added a comment -

        The question comes down to where is the single source of truth. When we wrote this code to avoid any split brain problems we decided that the single source of truth should be the state of the app stored in HDFS. Primarily because we did not want to change any of the RM APIs to allow for extra application state where the job succeeded, but the application is not done yet. We did this because the MR client APIs only used used the RM to determine if they should talk to the AM or the History server, and we assumed that everyone would use the MR APIs, or the _SUCCESS file. And as Jason has pointed out it keeps most cleanup operations happening in a state where they can be retried on an error.

        If you feel that you need to switch the source of truth to the RM, that is fine, but we still need a way to keep that state. Perhaps let the Application be in a "non-critical cleanup" state. The application has succeeded or failed, and that information is stored in the RM, but the application is still performing cleanup operations. This would allow the app to be relaunched, if needed, but if it crashes and runs out of attempts, the RM can give up, and still store that the app succeeded, but the with some errors during cleanup.

        Show
        revans2 Robert Joseph Evans added a comment - The question comes down to where is the single source of truth. When we wrote this code to avoid any split brain problems we decided that the single source of truth should be the state of the app stored in HDFS. Primarily because we did not want to change any of the RM APIs to allow for extra application state where the job succeeded, but the application is not done yet. We did this because the MR client APIs only used used the RM to determine if they should talk to the AM or the History server, and we assumed that everyone would use the MR APIs, or the _SUCCESS file. And as Jason has pointed out it keeps most cleanup operations happening in a state where they can be retried on an error. If you feel that you need to switch the source of truth to the RM, that is fine, but we still need a way to keep that state. Perhaps let the Application be in a "non-critical cleanup" state. The application has succeeded or failed, and that information is stored in the RM, but the application is still performing cleanup operations. This would allow the app to be relaunched, if needed, but if it crashes and runs out of attempts, the RM can give up, and still store that the app succeeded, but the with some errors during cleanup.
        Hide
        jlowe Jason Lowe added a comment -

        We previously observe the case that RM is restarting, such that the MR job fails at unregistration, it then starts the second attempt, and is running. On the other side, JHS has already show the job is finished successfully.

        Yes, the states can get out of sync, but does this actually break anything (e.g.: Oozie, Pig, something that will do something bad based on that out-of-sync state)? As Bobby mentioned with the _SUCCESS file and other out-of-band client notifications, we have this problem today and cannot solve it completely.

        Show
        jlowe Jason Lowe added a comment - We previously observe the case that RM is restarting, such that the MR job fails at unregistration, it then starts the second attempt, and is running. On the other side, JHS has already show the job is finished successfully. Yes, the states can get out of sync, but does this actually break anything (e.g.: Oozie, Pig, something that will do something bad based on that out-of-sync state)? As Bobby mentioned with the _SUCCESS file and other out-of-band client notifications, we have this problem today and cannot solve it completely.
        Hide
        zjshen Zhijie Shen added a comment -

        Is there a real-world use case that having history reporting success but RM reporting running/failed is going to break something?

        We previously observe the case that RM is restarting, such that the MR job fails at unregistration, it then starts the second attempt, and is running. On the other side, JHS has already show the job is finished successfully.

        Show
        zjshen Zhijie Shen added a comment - Is there a real-world use case that having history reporting success but RM reporting running/failed is going to break something? We previously observe the case that RM is restarting, such that the MR job fails at unregistration, it then starts the second attempt, and is running. On the other side, JHS has already show the job is finished successfully.
        Hide
        jlowe Jason Lowe added a comment -

        IMHO job status and RM being out of sync is something that can always happen. This is similar to Bobby's point about the _SUCCESS file and how killed jobs are handled. In this point of the job flow, the next AM attempt should not re-commit any output (due to existence of the committing file). Is there a real-world use case that having history reporting success but RM reporting running/failed is going to break something?

        Show
        jlowe Jason Lowe added a comment - IMHO job status and RM being out of sync is something that can always happen. This is similar to Bobby's point about the _SUCCESS file and how killed jobs are handled. In this point of the job flow, the next AM attempt should not re-commit any output (due to existence of the committing file). Is there a real-world use case that having history reporting success but RM reporting running/failed is going to break something?
        Hide
        zjshen Zhijie Shen added a comment -

        Jason, thanks for your input. For the first race condition, we are still able to handle it by change getProxy() a bit, because the AM is still alive for a while after RM says the app is finished. However, you second concern is a much stronger cons of moving the history file after unregistration.

        I used to think about another compromise, which may not completely solve the issue, but can largely reduce its occurrence: if unregistration happens, MRAppMaster goes to the intermediate done dir to delete the copied history file. However, it can not handle the other race condition that JHS move the history file to done dir before it gets deleted. Hopefully, this case should be quite rare. Any idea? I attached a sample patch to demonstrate this approach.

        Show
        zjshen Zhijie Shen added a comment - Jason, thanks for your input. For the first race condition, we are still able to handle it by change getProxy() a bit, because the AM is still alive for a while after RM says the app is finished. However, you second concern is a much stronger cons of moving the history file after unregistration. I used to think about another compromise, which may not completely solve the issue, but can largely reduce its occurrence: if unregistration happens, MRAppMaster goes to the intermediate done dir to delete the copied history file. However, it can not handle the other race condition that JHS move the history file to done dir before it gets deleted. Hopefully, this case should be quite rare. Any idea? I attached a sample patch to demonstrate this approach.
        Hide
        jlowe Jason Lowe added a comment -

        The client can still miss the history, and the realProxy cache only applies to existing clients. Here's the scenario:

        • Job finishes and unregisters with the RM then begins copying the history file to done_intermediate
        • While that occurs a client comes along to check the counters of the job. To do this, it must first contact the RM to check the job state to see whether it should contact the AM or the history server.
        • RM reports job has finished, so client goes to history server
        • History server doesn't have the file yet since AM hasn't completed copying it

        Besides this race, my other concern is that we're piling up tasks in the non-fault-tolerant portion of the job that are important to the user, namely providing history. Copying the history file is an operation that can take substantial time (e.g.: slow datanode), and the AM can fail before/during that operation. If we do this after we unregister then the RM will not retry and there will be no history. If we do it before we unregister then if the AM fails it will retry, the retry will realize there's nothing left to do but resume attempting to copy the history over to the history server, and we have some fault tolerance there.

        Show
        jlowe Jason Lowe added a comment - The client can still miss the history, and the realProxy cache only applies to existing clients. Here's the scenario: Job finishes and unregisters with the RM then begins copying the history file to done_intermediate While that occurs a client comes along to check the counters of the job. To do this, it must first contact the RM to check the job state to see whether it should contact the AM or the history server. RM reports job has finished, so client goes to history server History server doesn't have the file yet since AM hasn't completed copying it Besides this race, my other concern is that we're piling up tasks in the non-fault-tolerant portion of the job that are important to the user, namely providing history. Copying the history file is an operation that can take substantial time (e.g.: slow datanode), and the AM can fail before/during that operation. If we do this after we unregister then the RM will not retry and there will be no history. If we do it before we unregister then if the AM fails it will retry, the retry will realize there's nothing left to do but resume attempting to copy the history over to the history server, and we have some fault tolerance there.
        Hide
        zjshen Zhijie Shen added a comment -

        This issue has been left for a while. According to the previous discussion, as to the problem that the history file is available on JSH while AM fails unregistration, the straightforward solution is to move the history file to the intermediate done dir after unregistration. If we don't do that, the job has a second retry, and the second retry fails, we will still see a successful job on JHS, because according to my investigation, JHS will neglect the duplicate history file.

        Jason has the concern that if we do that, the client will be redirected to JHS before the history file is copied. I've done more investigation. The client service seems to be the last one to be stopped by MRAppMaster. On the other side, realProxy will cached until the connect get failed. Therefore, we just need to make sure copying the history file to the intermediate done dir after unregistration and before stopping the client service, and we should still be safe. Please correct me if I'm wrong.

        Any thoughts?

        Show
        zjshen Zhijie Shen added a comment - This issue has been left for a while. According to the previous discussion, as to the problem that the history file is available on JSH while AM fails unregistration, the straightforward solution is to move the history file to the intermediate done dir after unregistration. If we don't do that, the job has a second retry, and the second retry fails, we will still see a successful job on JHS, because according to my investigation, JHS will neglect the duplicate history file. Jason has the concern that if we do that, the client will be redirected to JHS before the history file is copied. I've done more investigation. The client service seems to be the last one to be stopped by MRAppMaster. On the other side, realProxy will cached until the connect get failed. Therefore, we just need to make sure copying the history file to the intermediate done dir after unregistration and before stopping the client service, and we should still be safe. Please correct me if I'm wrong. Any thoughts?
        Hide
        revans2 Robert Joseph Evans added a comment - - edited

        Why is it harmful for the AM and RM to have different knowledge? I can see it can be confusing, but we already live with it for the kill command. The AM does not tell the RM it was killed so that it can update the state, it looks like is succeeded to the RM. Why is this so critically different?

        What we want is to preserve as consistent a view for the end user as possible. But there are too many APIs that the end user can query to be totally consistent. There is the MR client API (Probably the most critical), the job end notifier (Which is documented as a best effort), The _SUCCESS file from the output format (Which is not optional), the AM/History server web service/web UI, and finally the RM UI.

        Ideally, the 6 steps should be consistent.

        The only way to do this is to have a single source of truth, where all other interfaces yield to that source of truth in the case of inconsistency, and do not report a final status until it is in a terminal state. Along with that we would ideally want it to be the last thing done, to be sure that everything happened correctly.

        The problem is the _SUCCESS file. We have no control over it. It is part of the output format and we cannot change it and preserve compatibility. As such it has to be the true indication of success or failure of a Job. To work around this and minimize any possibility of inconsistency didn't we add in a final status file, when we put in the split brain fix? When the AM restarts after a failure I thought it checked for this and if it were in a terminal state it would not rerun anything, it would just try to copy the history file and do the appropriate unregister. What this does is it gives us more chances to recover from whatever failure may have happened the first time (like not being able to move the history file). This is why I think we want to unregister last, or we will only have one chance before we are in an inconsistent state. I thought we tested that the history server can handle a duplicate history file showing up. So the only time that we really run into a situation where the Job succeeded and failed at the same time is when the AM failed to move the history file 4 times. I don't believe we have seen this inconsistency happen since we put in the split brain fix.

        The only other thing we can do is deprecate the output committer's ability to signal success; have the AM not report success or failure, but delegate that to the RM/History Server, which will slow down jobs significantly; Move Job End notification to the history server; and have the history server query the RM before it reports back anything (along with storing that result somewhere because the RM forgets things faster then the history server does).

        Show
        revans2 Robert Joseph Evans added a comment - - edited Why is it harmful for the AM and RM to have different knowledge? I can see it can be confusing, but we already live with it for the kill command. The AM does not tell the RM it was killed so that it can update the state, it looks like is succeeded to the RM. Why is this so critically different? What we want is to preserve as consistent a view for the end user as possible. But there are too many APIs that the end user can query to be totally consistent. There is the MR client API (Probably the most critical), the job end notifier (Which is documented as a best effort), The _SUCCESS file from the output format (Which is not optional), the AM/History server web service/web UI, and finally the RM UI. Ideally, the 6 steps should be consistent. The only way to do this is to have a single source of truth, where all other interfaces yield to that source of truth in the case of inconsistency, and do not report a final status until it is in a terminal state. Along with that we would ideally want it to be the last thing done, to be sure that everything happened correctly. The problem is the _SUCCESS file. We have no control over it. It is part of the output format and we cannot change it and preserve compatibility. As such it has to be the true indication of success or failure of a Job. To work around this and minimize any possibility of inconsistency didn't we add in a final status file, when we put in the split brain fix? When the AM restarts after a failure I thought it checked for this and if it were in a terminal state it would not rerun anything, it would just try to copy the history file and do the appropriate unregister. What this does is it gives us more chances to recover from whatever failure may have happened the first time (like not being able to move the history file). This is why I think we want to unregister last, or we will only have one chance before we are in an inconsistent state. I thought we tested that the history server can handle a duplicate history file showing up. So the only time that we really run into a situation where the Job succeeded and failed at the same time is when the AM failed to move the history file 4 times. I don't believe we have seen this inconsistency happen since we put in the split brain fix. The only other thing we can do is deprecate the output committer's ability to signal success; have the AM not report success or failure, but delegate that to the RM/History Server, which will slow down jobs significantly; Move Job End notification to the history server; and have the history server query the RM before it reports back anything (along with storing that result somewhere because the RM forgets things faster then the history server does).
        Hide
        zjshen Zhijie Shen added a comment -

        On shutting down an AM, there're following work:

        1. Finish OutputCommitter
        2. Move the history file to AHS (Maybe move to after unregister in this Jira)
        3. Unregister
        4. Delete staging dir
        5. Send end job notifier
        6. The implicit step of returning the final step to the client

        Ideally, the 6 steps should be consistent. However, each steps may fail, while it seems not to be possible to make them a transaction to succeed all or fail all. Nevertheless, IMHO, we should do as much as we can to ensure the consistency of each steps.

        Among the six steps, the most critical one is unregistration (correct me if I'm wrong), because it the only step that syncs with RM. It is the most harmful that AM and RM have different knowledge on the conclusion of the application. For this reason, unregister should be considered as the principle step, while how other steps behave should depend on the result of this step. Therefore, IMOH, unregister should be the first step to complete. On unregistration success, the following steps execute the ordinary logic, while on unregistration failure, the following steps handle the exceptions (e.g. not moving the job history file, not sending the job end notification and etc).

        As Jason Lowe mentioned, moving job history file may fail. It's right, but the failure is independent of whether it is before or after unregistration. Now, moving job history file is before unregistration. If moving job history file fails, unregistration will not be invoked, and the application may be concluded as FAILED. This should be not reasonable. Similarly, other steps shouldn't be the reason of failing an application except unregistration. The failure of them should be isolated, such that AM can proceed to the end.

        To sum up, IMHO, unregistration should be completed first, and be the step that judges the final state of the application. Given the result unregistration, the other steps decide what they should do, and the client see the final state. The other steps may fail or not fail, but the failure should be isolated. If fortunately none of steps fail (I guess it should be the most cases), the final states are consistent via every channels. If one step fails, it will only impact one part.

        Moreover, I'm not sure whether we'd like to add one more state for AM, which is unregistering. Move the job to unregistering before calling unregister and then move the job to the final state after all the steps are gone through.

        Show
        zjshen Zhijie Shen added a comment - On shutting down an AM, there're following work: 1. Finish OutputCommitter 2. Move the history file to AHS (Maybe move to after unregister in this Jira) 3. Unregister 4. Delete staging dir 5. Send end job notifier 6. The implicit step of returning the final step to the client Ideally, the 6 steps should be consistent. However, each steps may fail, while it seems not to be possible to make them a transaction to succeed all or fail all. Nevertheless, IMHO, we should do as much as we can to ensure the consistency of each steps. Among the six steps, the most critical one is unregistration (correct me if I'm wrong), because it the only step that syncs with RM. It is the most harmful that AM and RM have different knowledge on the conclusion of the application. For this reason, unregister should be considered as the principle step, while how other steps behave should depend on the result of this step. Therefore, IMOH, unregister should be the first step to complete. On unregistration success, the following steps execute the ordinary logic, while on unregistration failure, the following steps handle the exceptions (e.g. not moving the job history file, not sending the job end notification and etc). As Jason Lowe mentioned, moving job history file may fail. It's right, but the failure is independent of whether it is before or after unregistration. Now, moving job history file is before unregistration. If moving job history file fails, unregistration will not be invoked, and the application may be concluded as FAILED. This should be not reasonable. Similarly, other steps shouldn't be the reason of failing an application except unregistration. The failure of them should be isolated, such that AM can proceed to the end. To sum up, IMHO, unregistration should be completed first, and be the step that judges the final state of the application. Given the result unregistration, the other steps decide what they should do, and the client see the final state. The other steps may fail or not fail, but the failure should be isolated. If fortunately none of steps fail (I guess it should be the most cases), the final states are consistent via every channels. If one step fails, it will only impact one part. Moreover, I'm not sure whether we'd like to add one more state for AM, which is unregistering. Move the job to unregistering before calling unregister and then move the job to the final state after all the steps are gone through.
        Hide
        jlowe Jason Lowe added a comment -

        I'm not sure we should try to enforce YARN failure == MR failure because I don't think it's completely enforceable. The output committer is user code that can do arbitrary things, including custom job end notification e.g. FileOutputCommitter and the _SUCCESS file. As such there will always be cases where downstream consumers of the job will think it succeeded and proceed as normal despite what the RM says. In addition this change creates a couple of new problems:

        • The app can successfully unregister but fail to copy the history file, so now we have a case where the RM says the job succeeded but the history server will say ComeBackToMeLater until client times out. Would the history server no longer have a quick way to say "I definitely don't know about that job"?
        • We're starting to pile quite a few things into the grace period, and I'm wondering if there will be enough time to get it all done if things aren't all working properly. e.g.: slow network connection when trying to do job end notification, slow datanode(s) when copying history file, etc. Deleting the staging directory must be in the grace period to allow reattempts if we crash before unregistering, but I'm not sure we need all this other stuff there as well.

        I want to make sure we're not causing more problems than we're solving. Succeeding to perform job end notification and copy the history file but fail to unregister should be a very rare instance, and even if it occurs it's likely there will be a subsequent attempt that will be launched, read the previous history file, realize the job succeeded, and unregister successfully. It's only an issue if it also happens to be the last attempt unless I'm missing something. Moving all of the MR-specific job end stuff to after we unregister would be setting ourselves up for increasing the average fault visibility. Anything that goes wrong during the grace period (e.g.: AM failure/crash) will not be reattempted since the RM thinks the app is done, where it would have in the current setup if there were attempts remaining. Given that anything in the grace period is very fragile, I think we want to put as few things there as possible.

        Since jobs can indicate success to downstream consumers in ways we can't always control, I think it would be better to embrace the fact that sometimes YARN state != MR state and act accordingly. I think this only requires one change to ClientServiceDelegate, as currently it assumes that a YARN state of FAILED means the job failed. The client should redirect to the history server if the app is in any terminal YARN state (i.e.: FINISHED/FAILED/KILLED) and only use the YARN state as the job state if the history server doesn't know about the job.

        Show
        jlowe Jason Lowe added a comment - I'm not sure we should try to enforce YARN failure == MR failure because I don't think it's completely enforceable. The output committer is user code that can do arbitrary things, including custom job end notification e.g. FileOutputCommitter and the _SUCCESS file. As such there will always be cases where downstream consumers of the job will think it succeeded and proceed as normal despite what the RM says. In addition this change creates a couple of new problems: The app can successfully unregister but fail to copy the history file, so now we have a case where the RM says the job succeeded but the history server will say ComeBackToMeLater until client times out. Would the history server no longer have a quick way to say "I definitely don't know about that job"? We're starting to pile quite a few things into the grace period, and I'm wondering if there will be enough time to get it all done if things aren't all working properly. e.g.: slow network connection when trying to do job end notification, slow datanode(s) when copying history file, etc. Deleting the staging directory must be in the grace period to allow reattempts if we crash before unregistering, but I'm not sure we need all this other stuff there as well. I want to make sure we're not causing more problems than we're solving. Succeeding to perform job end notification and copy the history file but fail to unregister should be a very rare instance, and even if it occurs it's likely there will be a subsequent attempt that will be launched, read the previous history file, realize the job succeeded, and unregister successfully. It's only an issue if it also happens to be the last attempt unless I'm missing something. Moving all of the MR-specific job end stuff to after we unregister would be setting ourselves up for increasing the average fault visibility. Anything that goes wrong during the grace period (e.g.: AM failure/crash) will not be reattempted since the RM thinks the app is done, where it would have in the current setup if there were attempts remaining. Given that anything in the grace period is very fragile, I think we want to put as few things there as possible. Since jobs can indicate success to downstream consumers in ways we can't always control, I think it would be better to embrace the fact that sometimes YARN state != MR state and act accordingly. I think this only requires one change to ClientServiceDelegate, as currently it assumes that a YARN state of FAILED means the job failed. The client should redirect to the history server if the app is in any terminal YARN state (i.e.: FINISHED/FAILED/KILLED) and only use the YARN state as the job state if the history server doesn't know about the job.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment - - edited

        We finally sat down and reasoned about all things in MR App that are broken because of various race conditions during RM restart. While fixing that is a larger effort, looking at this specific problem, I think we should stick to the invariant that if RM sees the app as failed, we should make sure clients also see the same.

        In such cases where the AM has successfully finished the 'job', but failed to unregister or failed to write history file(we ran into this also), the client will still the job as running till the last attempt. And if the last attempt also fails with the same reason, it sees the job as failed. In corner cases, we will lose work but that's better than clients struggling with successful jobs with no history files or a failure information on RM.

        That said, to really fix this issue, we should change the order of things in the AM - unregister should be the first thing that should happen. Previously we moved JobHistory flush/close to be before unregister as we didn't have the AM grace period as we have now. Given that we now have the AM grace period, we can do the following

        • Flush history events & close the current history file
        • unregister and loop till RM acknowledges (YARN-540)
        • If unregister fails, don't do anything any more - irrespective of whether this is the last retry or not. This is done at MAPREDUCE-5562.
        • Otherwise, if this is not the last retry, then
          • let the client loop (safeTermination flag) (MAPREDUCE-5505)
          • Don't copy the history file (TODO, this JIRA)
          • Don't send the job-end notification (MAPREDUCE-5538)
          • Don't delete the staging directory (Done)
          • Exit
        • Otherwise, this is the last retry (usual code path, all DONE)
          • copy the history file to intermediate done directory
          • send the job notification URL and
          • let the client know job succeeded/failed/killed
          • remove the staging directory.
          • Exit

        There are a couple more things that need fixing in separate JIRAs

        • AM reports final tracking URL as part of finishApplicationRequest. RM should set this URL for clients to see ONLY after the state operations are done
        • There is a duration during which AM unregistered successfully with RM, but hasn't yet copied the history file to the intermediate done directory and so JHS cannot serve this job to the client. We should change JHS to throw a ComeBackToMeLaterException and client retrying for the availability of the hist file till a timeout.

        Few more things that need more thought

        • _SUCCESS files and their creation time given the above design
        • The commit start and end files and their cleanup
        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - - edited We finally sat down and reasoned about all things in MR App that are broken because of various race conditions during RM restart. While fixing that is a larger effort, looking at this specific problem, I think we should stick to the invariant that if RM sees the app as failed, we should make sure clients also see the same. In such cases where the AM has successfully finished the 'job', but failed to unregister or failed to write history file(we ran into this also), the client will still the job as running till the last attempt. And if the last attempt also fails with the same reason, it sees the job as failed. In corner cases, we will lose work but that's better than clients struggling with successful jobs with no history files or a failure information on RM. That said, to really fix this issue, we should change the order of things in the AM - unregister should be the first thing that should happen. Previously we moved JobHistory flush/close to be before unregister as we didn't have the AM grace period as we have now. Given that we now have the AM grace period, we can do the following Flush history events & close the current history file unregister and loop till RM acknowledges ( YARN-540 ) If unregister fails, don't do anything any more - irrespective of whether this is the last retry or not. This is done at MAPREDUCE-5562 . Otherwise, if this is not the last retry, then let the client loop (safeTermination flag) ( MAPREDUCE-5505 ) Don't copy the history file (TODO, this JIRA) Don't send the job-end notification ( MAPREDUCE-5538 ) Don't delete the staging directory (Done) Exit Otherwise, this is the last retry (usual code path, all DONE) copy the history file to intermediate done directory send the job notification URL and let the client know job succeeded/failed/killed remove the staging directory. Exit There are a couple more things that need fixing in separate JIRAs AM reports final tracking URL as part of finishApplicationRequest. RM should set this URL for clients to see ONLY after the state operations are done There is a duration during which AM unregistered successfully with RM, but hasn't yet copied the history file to the intermediate done directory and so JHS cannot serve this job to the client. We should change JHS to throw a ComeBackToMeLaterException and client retrying for the availability of the hist file till a timeout. Few more things that need more thought _SUCCESS files and their creation time given the above design The commit start and end files and their cleanup
        Hide
        jlowe Jason Lowe added a comment -

        Therefore, IMHO, it's not good to fix a bug in a rare case at the cost of troubling the common case.

        Exactly, that's what I'm worried about.

        There is no general solution to the client-sees-app-succeed-but-app-subsequently-fails-to-unregister problem. There are tons of ways a client can be notified of job success besides the standard JobClient query (e.g.: _SUCCESS file generated by FileOutputFormat). The output committer is user-defined code and can do arbitrary things. The main thing is to ensure the subsequent AM attempt, if there is one, does not delete/corrupt the data of the previous successful attempt. That's why the subsequent AM checks the jhist file for a successful commit from the previous attempt and if that's the case unregisters with a success code without doing much else. I think that's the best we can do. Clients trying to check job status via JobClient or the proxy URL will be redirected to the history server and see that the job succeeded. The only oddity will be if the issue occurred on the last AM attempt then the RM will report the app as failed but the job succeeded in the MR sense. It should be a rare case but can happen, and we cannot prevent all cases of a client seeing an MR job succeed but RM reports it as failed.

        Show
        jlowe Jason Lowe added a comment - Therefore, IMHO, it's not good to fix a bug in a rare case at the cost of troubling the common case. Exactly, that's what I'm worried about. There is no general solution to the client-sees-app-succeed-but-app-subsequently-fails-to-unregister problem. There are tons of ways a client can be notified of job success besides the standard JobClient query (e.g.: _SUCCESS file generated by FileOutputFormat). The output committer is user-defined code and can do arbitrary things. The main thing is to ensure the subsequent AM attempt, if there is one, does not delete/corrupt the data of the previous successful attempt. That's why the subsequent AM checks the jhist file for a successful commit from the previous attempt and if that's the case unregisters with a success code without doing much else. I think that's the best we can do. Clients trying to check job status via JobClient or the proxy URL will be redirected to the history server and see that the job succeeded. The only oddity will be if the issue occurred on the last AM attempt then the RM will report the app as failed but the job succeeded in the MR sense. It should be a rare case but can happen, and we cannot prevent all cases of a client seeing an MR job succeed but RM reports it as failed.
        Hide
        jianhe Jian He added a comment -

        but it can eventually show up after the history data is copied to done_intermediate.

        Correct, retract my comment, the worse is that JobClient may just crash if it cannot retrieve job status from the JHS when it resorts to JHS after app finishes.

        Show
        jianhe Jian He added a comment - but it can eventually show up after the history data is copied to done_intermediate. Correct, retract my comment, the worse is that JobClient may just crash if it cannot retrieve job status from the JHS when it resorts to JHS after app finishes.
        Hide
        zjshen Zhijie Shen added a comment -

        I thought about the problem again. This jira described the case that when the AM has moved the job history file to JHS, and then somehow failed at unregister. Then the 2nd AM attempt will start, but from the aspect of JHS, user will see the job is already finished. Actually, it should be a common case.

        If we changed to unregistered AM before moving the history file to JHS, it would happen that RM tells the client that the job is already finished, and the client will resort to JHS to get the information. However, the history file hasn't arrived JHS (copying big files may take time). Therefore, the job information will be unavailable for some time. Unfortunately, if we did this change, the unavailability of the job information would be a common case around every jobs' completion, no matter they succeed, fail, reboot.

        Therefore, IMHO, it's not good to fix a bug in a rare case at the cost of troubling the common case. Probably we need to find some approach else.

        Show
        zjshen Zhijie Shen added a comment - I thought about the problem again. This jira described the case that when the AM has moved the job history file to JHS, and then somehow failed at unregister. Then the 2nd AM attempt will start, but from the aspect of JHS, user will see the job is already finished. Actually, it should be a common case. If we changed to unregistered AM before moving the history file to JHS, it would happen that RM tells the client that the job is already finished, and the client will resort to JHS to get the information. However, the history file hasn't arrived JHS (copying big files may take time). Therefore, the job information will be unavailable for some time. Unfortunately, if we did this change, the unavailability of the job information would be a common case around every jobs' completion, no matter they succeed, fail, reboot. Therefore, IMHO, it's not good to fix a bug in a rare case at the cost of troubling the common case. Probably we need to find some approach else.
        Hide
        jianhe Jian He added a comment -

        The problem being that the history sever can copy the history data to the done_intermediate directory and then unregister fails. Then the AM is relaunched, but user already see the finished status of the job in history sever

        If the history server has already moved it from done_intermediate to done then the history server could either re-update the history with the new copy in done_intermediate or simply delete the redundant copy in done_intermediate.

        we can do this, but user still see the finished status of the job after the 1st AM unregisters, but just that the status will be updated until the next AM finishes.

        If we unregister before copying the history data to the done_intermediate directory then the client could try to query the history server before the AM has had a chance to copy the jhist file.

        Yes, the job in history sever may be missing for some time, but it can eventually show up after the history data is copied to done_intermediate.

        Show
        jianhe Jian He added a comment - The problem being that the history sever can copy the history data to the done_intermediate directory and then unregister fails. Then the AM is relaunched, but user already see the finished status of the job in history sever If the history server has already moved it from done_intermediate to done then the history server could either re-update the history with the new copy in done_intermediate or simply delete the redundant copy in done_intermediate. we can do this, but user still see the finished status of the job after the 1st AM unregisters, but just that the status will be updated until the next AM finishes. If we unregister before copying the history data to the done_intermediate directory then the client could try to query the history server before the AM has had a chance to copy the jhist file. Yes, the job in history sever may be missing for some time, but it can eventually show up after the history data is copied to done_intermediate.
        Hide
        jlowe Jason Lowe added a comment -

        Won't this create a race condition in the client where the job can be lost? Currently the client relies on the job state as reported by the RM to know where to query for job status. If the RM reports the job as running it will try to query the AM, otherwise it will try the history server. If we unregister before copying the history data to the done_intermediate directory then the client could try to query the history server before the AM has had a chance to copy the jhist file.

        The AM needs to move the file to the done_intermediate directory before unregistering or we create this race. If the concern is that the unregister could fail and the AM could re-upload the jhist file, we could make the AM smarter when it goes to upload (e.g.: check for it already existing in the done_intermediate directory before copying). If the history server has already moved it from done_intermediate to done then the history server could either re-update the history with the new copy in done_intermediate or simply delete the redundant copy in done_intermediate.

        Show
        jlowe Jason Lowe added a comment - Won't this create a race condition in the client where the job can be lost? Currently the client relies on the job state as reported by the RM to know where to query for job status. If the RM reports the job as running it will try to query the AM, otherwise it will try the history server. If we unregister before copying the history data to the done_intermediate directory then the client could try to query the history server before the AM has had a chance to copy the jhist file. The AM needs to move the file to the done_intermediate directory before unregistering or we create this race. If the concern is that the unregister could fail and the AM could re-upload the jhist file, we could make the AM smarter when it goes to upload (e.g.: check for it already existing in the done_intermediate directory before copying). If the history server has already moved it from done_intermediate to done then the history server could either re-update the history with the new copy in done_intermediate or simply delete the redundant copy in done_intermediate.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        What is really needed is that, the events can be flushed, but the file isn't moved to the intermediate done directory till the unregister succeeds.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - What is really needed is that, the events can be flushed, but the file isn't moved to the intermediate done directory till the unregister succeeds.

          People

          • Assignee:
            zjshen Zhijie Shen
            Reporter:
            zjshen Zhijie Shen
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:

              Development