Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5641

History for failed Application Masters should be made available to the Job History Server

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Labels:
      None

      Description

      Currently, the JHS has no information about jobs whose AMs have failed. This is because the History is written by the AM to the intermediate folder just before finishing, so when it fails for any reason, this information isn't copied there. However, it is not lost as its in the AM's staging directory. To make the History available in the JHS, all we need to do is have another mechanism to move the History from the staging directory to the intermediate directory. The AM also writes a "Summary" file before exiting normally, which is also unavailable when the AM fails.

      1. MAPREDUCE-5641.patch
        27 kB
        Robert Kanter
      2. MAPREDUCE-5641.patch
        27 kB
        Robert Kanter

        Issue Links

          Activity

          Hide
          Vinod Kumar Vavilapalli added a comment - - edited

          Unless I am missing something, I still don't understand why my proposal here of making JHS talk to RM about the application-information is not enough to begin with. It can be extended in future to talk to AHS to obtain more information.

          To your question about scale, Jason did answer that it can be done on demand for only those apps which don't have history files.

          Show
          Vinod Kumar Vavilapalli added a comment - - edited Unless I am missing something, I still don't understand why my proposal here of making JHS talk to RM about the application-information is not enough to begin with. It can be extended in future to talk to AHS to obtain more information. To your question about scale, Jason did answer that it can be done on demand for only those apps which don't have history files.
          Hide
          Karthik Kambatla added a comment -

          Thinking more about this, I am slightly wary of using AHSClient or the store directly for this, before we iron out any rough edges and mark them stable.

          Vinod Kumar Vavilapalli, Zhijie Shen - do you think it is reasonable to let this go through for now, even though it is not the cleanest approach and adds duplicate code. Once AHS is stable, we can follow up with removing the flag-file parts in YARN and updating the JHS parts of the code to use AHS instead of flag-file?

          Show
          Karthik Kambatla added a comment - Thinking more about this, I am slightly wary of using AHSClient or the store directly for this, before we iron out any rough edges and mark them stable. Vinod Kumar Vavilapalli , Zhijie Shen - do you think it is reasonable to let this go through for now, even though it is not the cleanest approach and adds duplicate code. Once AHS is stable, we can follow up with removing the flag-file parts in YARN and updating the JHS parts of the code to use AHS instead of flag-file?
          Hide
          Zhijie Shen added a comment -

          So it sounds like instead of doing YARN-1731 to make the RM write a little flag file that the JHS can check for, we can have the JHS check this store just like the AHS is doing. That should be cleaner.

          It could be an option, but depends on what information you want. According to my previous understanding, you plan to inspect jhist file, and probably look for MR specific information, such as map, reduce, shuffle, merge and etc. It cannot be obtained from AHS. In contrast, some other generic information, such as start time, finish time, host and etc can be obtained from AHS. Perhaps, you can choose to recover part of information for failed MR AM now, and make a complete recovery whenever MR reports its specific information to timeline service.

          What is the store that its using? And where can I find out more about it or its API so I can update this patch to use it.

          The suggested way to access the information is not read from the store directly, but use AHSClient or web services, suppose you are going to programmatically do this.

          Show
          Zhijie Shen added a comment - So it sounds like instead of doing YARN-1731 to make the RM write a little flag file that the JHS can check for, we can have the JHS check this store just like the AHS is doing. That should be cleaner. It could be an option, but depends on what information you want. According to my previous understanding, you plan to inspect jhist file, and probably look for MR specific information, such as map, reduce, shuffle, merge and etc. It cannot be obtained from AHS. In contrast, some other generic information, such as start time, finish time, host and etc can be obtained from AHS. Perhaps, you can choose to recover part of information for failed MR AM now, and make a complete recovery whenever MR reports its specific information to timeline service. What is the store that its using? And where can I find out more about it or its API so I can update this patch to use it. The suggested way to access the information is not read from the store directly, but use AHSClient or web services, suppose you are going to programmatically do this.
          Hide
          Robert Kanter added a comment -

          So it sounds like instead of doing YARN-1731 to make the RM write a little flag file that the JHS can check for, we can have the JHS check this store just like the AHS is doing. That should be cleaner.

          What is the store that its using? And where can I find out more about it or its API so I can update this patch to use it.

          Show
          Robert Kanter added a comment - So it sounds like instead of doing YARN-1731 to make the RM write a little flag file that the JHS can check for, we can have the JHS check this store just like the AHS is doing. That should be cleaner. What is the store that its using? And where can I find out more about it or its API so I can update this patch to use it.
          Hide
          Zhijie Shen added a comment -

          ah, sorry I said the wrong word. It should be finished, failed, killed. If AM crashes, given no more retry, the application will be failed, right. AHS records the information from the view of RM.

          Please excuse my ignorance about AHS. What is the source of applications for the AHS? Does it periodically poll the RM? Or, does the RM trigger something on the completion of an app or its attempts?

          AHS doesn't query RM. Instead RM pushes the information to a store where AHS can read. The information will be pushed in terms of events before the application life cycle gets completed, no matter whether it completes as finished, failed or killed.

          Show
          Zhijie Shen added a comment - ah, sorry I said the wrong word. It should be finished, failed , killed. If AM crashes, given no more retry, the application will be failed, right. AHS records the information from the view of RM. Please excuse my ignorance about AHS. What is the source of applications for the AHS? Does it periodically poll the RM? Or, does the RM trigger something on the completion of an app or its attempts? AHS doesn't query RM. Instead RM pushes the information to a store where AHS can read. The information will be pushed in terms of events before the application life cycle gets completed, no matter whether it completes as finished, failed or killed.
          Hide
          Karthik Kambatla added a comment -

          This JIRA is not aimed at applications that have finished, removed or killed. I guess the issue is here is those AMs that crash - so, the AMs don't leave any information about their existence. In this case, the JHS wouldn't know and hence wont show them.

          Please excuse my ignorance about AHS. What is the source of applications for the AHS? Does it periodically poll the RM? Or, does the RM trigger something on the completion of an app or its attempts?

          Show
          Karthik Kambatla added a comment - This JIRA is not aimed at applications that have finished, removed or killed. I guess the issue is here is those AMs that crash - so, the AMs don't leave any information about their existence. In this case, the JHS wouldn't know and hence wont show them. Please excuse my ignorance about AHS. What is the source of applications for the AHS? Does it periodically poll the RM? Or, does the RM trigger something on the completion of an app or its attempts?
          Hide
          Zhijie Shen added a comment -

          could you point us to how the AHS gets this information for AMs that crash. We might be able to re-use some of that if the RM side of things for doing this is stable.

          No matter an application is finished, removed or killed, it is supposed to be recorded by AHS. However, it depends on what you need. If you're looking for the generic information, AHS should meet your requirement. Otherwise, you still need to walk around before per framework information of MR can be recorded.

          Show
          Zhijie Shen added a comment - could you point us to how the AHS gets this information for AMs that crash. We might be able to re-use some of that if the RM side of things for doing this is stable. No matter an application is finished, removed or killed, it is supposed to be recorded by AHS. However, it depends on what you need. If you're looking for the generic information, AHS should meet your requirement. Otherwise, you still need to walk around before per framework information of MR can be recorded.
          Hide
          Jason Lowe added a comment -

          I originally thought that as well but then wondered if the query was to be lazily performed. It would query the RM when asked for a particular job for which it could not find the jhist in either done or done_intermediate. That would solve the issue for providing a specific job's history but not the use-case of browsing for it.

          Show
          Jason Lowe added a comment - I originally thought that as well but then wondered if the query was to be lazily performed. It would query the RM when asked for a particular job for which it could not find the jhist in either done or done_intermediate. That would solve the issue for providing a specific job's history but not the use-case of browsing for it.
          Hide
          Karthik Kambatla added a comment -

          Vinod Kumar Vavilapalli - could you point us to how the AHS gets this information for AMs that crash. We might be able to re-use some of that if the RM side of things for doing this is stable.

          Show
          Karthik Kambatla added a comment - Vinod Kumar Vavilapalli - could you point us to how the AHS gets this information for AMs that crash. We might be able to re-use some of that if the RM side of things for doing this is stable.
          Hide
          Karthik Kambatla added a comment -

          Instead of adding new functionality, can JHS simply ask RM about the application-status. Why would that not work?

          That would work, but on a cluster with say 10,000 running apps, the JHS would query the status of each app or fetch all apps every so often. It is nicer to avoid the poll model, no?

          Show
          Karthik Kambatla added a comment - Instead of adding new functionality, can JHS simply ask RM about the application-status. Why would that not work? That would work, but on a cluster with say 10,000 running apps, the JHS would query the status of each app or fetch all apps every so often. It is nicer to avoid the poll model, no?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Vinod Kumar Vavilapalli, I'm a bit reluctant to get the JHS to depend on the AHS at this point as the AHS is not fully cooked. I would prefer dropping the JHS alltogether in favor of the AHS when the AHS is ready for prime time with AM extensions.

          The problem is that as I understand it, this JIRA requires corresponding changes in YARN via YARN-1731. It doesn't make sense to add duplicate functionality in YARN.

          Instead of adding new functionality, can JHS simply ask RM about the application-status. Why would that not work? Clearly if RM goes down and comes back up, it may lose history, but for that you need to enable the state-store anyways. But otherwise, it should work for the most part. Thoughts?

          Show
          Vinod Kumar Vavilapalli added a comment - Vinod Kumar Vavilapalli, I'm a bit reluctant to get the JHS to depend on the AHS at this point as the AHS is not fully cooked. I would prefer dropping the JHS alltogether in favor of the AHS when the AHS is ready for prime time with AM extensions. The problem is that as I understand it, this JIRA requires corresponding changes in YARN via YARN-1731 . It doesn't make sense to add duplicate functionality in YARN. Instead of adding new functionality, can JHS simply ask RM about the application-status. Why would that not work? Clearly if RM goes down and comes back up, it may lose history, but for that you need to enable the state-store anyways. But otherwise, it should work for the most part. Thoughts?
          Hide
          Robert Kanter added a comment -

          The new patch has the JHS proxy as the user to do the copying; though you need to add the user running the JHS as a proxyuser in core-site.xml. The patch also fixes a few bugs I found while testing. I've verified that it works correctly in a Kerberoized cluster. I've also updated the corresponding patch in YARN-1731.

          Show
          Robert Kanter added a comment - The new patch has the JHS proxy as the user to do the copying; though you need to add the user running the JHS as a proxyuser in core-site.xml. The patch also fixes a few bugs I found while testing. I've verified that it works correctly in a Kerberoized cluster. I've also updated the corresponding patch in YARN-1731 .
          Hide
          Alejandro Abdelnur added a comment -

          yep, I've meant that. The JHS is trusted code, no user code running there. The doAs with the proxy user would be used only for this case. Also, all this would go away when the AHS is ready to take over.

          Show
          Alejandro Abdelnur added a comment - yep, I've meant that. The JHS is trusted code, no user code running there. The doAs with the proxy user would be used only for this case. Also, all this would go away when the AHS is ready to take over.
          Hide
          Jason Lowe added a comment -

          In theory you could make the JHS able to proxy as users in HDFS so it can read the necessary files in the staging directory, if that's what you intended to suggest. Not sure I'm thrilled with the JHS having the ability to do anything in HDFS, but it should work.

          Show
          Jason Lowe added a comment - In theory you could make the JHS able to proxy as users in HDFS so it can read the necessary files in the staging directory, if that's what you intended to suggest. Not sure I'm thrilled with the JHS having the ability to do anything in HDFS, but it should work.
          Hide
          Jason Lowe added a comment -

          how about not touching the current permissions of stating and making the RM a proxy user in HDFS. Then the files would be written as the user.

          The issue is not the permissions of the proposed file the RM would write, rather the permissions of the .jhist and job.xml files written by the job. Those are already owner by the user and the RM isn't involved at all. The issue with the originally proposed approach is that the JHS is not the user and therefore cannot access the necessary files to place them in the proper locations after the job completes (something the AM normally does).

          Show
          Jason Lowe added a comment - how about not touching the current permissions of stating and making the RM a proxy user in HDFS. Then the files would be written as the user. The issue is not the permissions of the proposed file the RM would write, rather the permissions of the .jhist and job.xml files written by the job. Those are already owner by the user and the RM isn't involved at all. The issue with the originally proposed approach is that the JHS is not the user and therefore cannot access the necessary files to place them in the proper locations after the job completes (something the AM normally does).
          Hide
          Alejandro Abdelnur added a comment -

          Robert Kanter, Jason Lowe, how about not touching the current permissions of stating and making the RM a proxy user in HDFS. Then the files would be written as the user.

          Vinod Kumar Vavilapalli, I'm a bit reluctant to get the JHS to depend on the AHS at this point as the AHS is not fully cooked. I would prefer dropping the JHS alltogether in favor of the AHS when the AHS is ready for prime time with AM extensions.

          Show
          Alejandro Abdelnur added a comment - Robert Kanter , Jason Lowe , how about not touching the current permissions of stating and making the RM a proxy user in HDFS. Then the files would be written as the user. Vinod Kumar Vavilapalli , I'm a bit reluctant to get the JHS to depend on the AHS at this point as the AHS is not fully cooked. I would prefer dropping the JHS alltogether in favor of the AHS when the AHS is ready for prime time with AM extensions.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Haven't yet read the discussion, but overall, we don't need yet another solution for this. YARN-321 already is enabling generic history and so has record of killed/failed applications. If at all we need a fix,

          • For the short term, we should make JHS invoke web-services on RM and/or AHS to obtain this information.
          • Medium/longer term, the generic data and timeline data (YARN-1530) will merge to expose all information about apps via web-services. And JHS (if it still exists by that time) should just use them.
          Show
          Vinod Kumar Vavilapalli added a comment - Haven't yet read the discussion, but overall, we don't need yet another solution for this. YARN-321 already is enabling generic history and so has record of killed/failed applications. If at all we need a fix, For the short term, we should make JHS invoke web-services on RM and/or AHS to obtain this information. Medium/longer term, the generic data and timeline data ( YARN-1530 ) will merge to expose all information about apps via web-services. And JHS (if it still exists by that time) should just use them.
          Hide
          Jason Lowe added a comment -

          Do you have any alternatives on how to allow the JHS to have access to those files?

          Outside of imposing new restrictions on where the staging directory can be and how it has to be configured, no I don't know of an easy way to do that. To allow the JHS to access these files, we'd minimally have to require the user directories in the staging area to have their group set to the "hadoop" group (see http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/ClusterSetup.html#Running_Hadoop_in_Secure_Mode for details on that group) and have permissions of 0750 all the way down to the specific staging directory for a job. Read permission is required so the history server can scan for the proper jhist file to grab, since a job with multiple AM attempts means the JHS can't just know what the name of the correct JHS file is – it would have to scan to see which is the latest. That would relax the permissions on a user's staging files to include the hadoop group. That's probably OK and far better than letting everyone in, but I haven't thought through all of the security ramifications of doing so.

          Or to somehow get those files into the done_intermediate dir?

          A proper way to do this would be to have something run by the user of the job do this, as that doesn't require any additional security beyond what's already done today. However that probably involves adding the ability in YARN for a specified task to run when an application is failed/killed to cleanup after the unsuccessful run. It's a non-trivial task, but it would also help solve the problem we have today where staging directories are leaked for applications that are killed before the AM launches.

          Show
          Jason Lowe added a comment - Do you have any alternatives on how to allow the JHS to have access to those files? Outside of imposing new restrictions on where the staging directory can be and how it has to be configured, no I don't know of an easy way to do that. To allow the JHS to access these files, we'd minimally have to require the user directories in the staging area to have their group set to the "hadoop" group (see http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/ClusterSetup.html#Running_Hadoop_in_Secure_Mode for details on that group) and have permissions of 0750 all the way down to the specific staging directory for a job. Read permission is required so the history server can scan for the proper jhist file to grab, since a job with multiple AM attempts means the JHS can't just know what the name of the correct JHS file is – it would have to scan to see which is the latest. That would relax the permissions on a user's staging files to include the hadoop group. That's probably OK and far better than letting everyone in, but I haven't thought through all of the security ramifications of doing so. Or to somehow get those files into the done_intermediate dir? A proper way to do this would be to have something run by the user of the job do this, as that doesn't require any additional security beyond what's already done today. However that probably involves adding the ability in YARN for a specified task to run when an application is failed/killed to cleanup after the unsuccessful run. It's a non-trivial task, but it would also help solve the problem we have today where staging directories are leaked for applications that are killed before the AM launches.
          Hide
          Robert Kanter added a comment -

          hmm... I hadn't thought about the security of those files. Do you have any alternatives on how to allow the JHS to have access to those files? Or to somehow get those files into the done_intermediate dir?

          Show
          Robert Kanter added a comment - hmm... I hadn't thought about the security of those files. Do you have any alternatives on how to allow the JHS to have access to those files? Or to somehow get those files into the done_intermediate dir?
          Hide
          Jason Lowe added a comment -

          I should also point out that the assumption that the staging directory itself may not be publicly accessible. The staging area is configurable, and our current setup places the staging area at /user. That puts each user's .staging directory under their home directory, and the home directory of most users is locked down to 700.

          Show
          Jason Lowe added a comment - I should also point out that the assumption that the staging directory itself may not be publicly accessible. The staging area is configurable, and our current setup places the staging area at /user. That puts each user's .staging directory under their home directory, and the home directory of most users is locked down to 700.
          Hide
          Jason Lowe added a comment -

          I don't believe that will work either, since the job history and job.xml files are 0600

          Sorry, this is incorrect – I was looking at the wrong files on one of our clusters. The job conf and jhist files are 644 by default, so it will work but insecurely.

          Show
          Jason Lowe added a comment - I don't believe that will work either, since the job history and job.xml files are 0600 Sorry, this is incorrect – I was looking at the wrong files on one of our clusters. The job conf and jhist files are 644 by default, so it will work but insecurely.
          Hide
          Jason Lowe added a comment -

          I modified the permissions from 0700 to 0701.

          I don't believe that will work either, since the job history and job.xml files are 0600. So even if the history server can see it via the execute bit it won't be able to copy it. If we allow it to copy it then it's not secure. With those permissions, anyone with a job ID of an active job, the job's user, and the job's staging directory can obtain the job configuration (via job.xml) and job counters (via <jobid>_1.jhist). The information needed to pull this off is trivially available, as the first two are on the front page of the RM and the latter is in public cluster configs.

          Show
          Jason Lowe added a comment - I modified the permissions from 0700 to 0701. I don't believe that will work either, since the job history and job.xml files are 0600. So even if the history server can see it via the execute bit it won't be able to copy it. If we allow it to copy it then it's not secure. With those permissions, anyone with a job ID of an active job, the job's user, and the job's staging directory can obtain the job configuration (via job.xml) and job counters (via <jobid>_1.jhist). The information needed to pull this off is trivially available, as the first two are on the front page of the RM and the latter is in public cluster configs.
          Hide
          Robert Kanter added a comment -

          I’ve attached a preliminary version of the patch. Once we all agree on the specifics of the design, I can add unit tests.
          The patch follows the design I outlined before where the RM will write a file when it sees an AM die and the JHS see that and copies the jhist and similar files to the done_intermediate dir. I have tested this by running jobs and killing the AM. This results in incomplete information, as expected; however, in some cases some of the information won’t make 100% sense or is missing (e.g. no Finish Time if the AM didn’t actually finish). I’ve put in some code to take care of these situations. I’ve also attached a preliminary YARN patch to YARN-1731.

          How will the JHS copy the file to the intermediate directory? It likely won't have access to the staging directory containing the jhist file.

          I modified the permissions from 0700 to 0701.

          Show
          Robert Kanter added a comment - I’ve attached a preliminary version of the patch. Once we all agree on the specifics of the design, I can add unit tests. The patch follows the design I outlined before where the RM will write a file when it sees an AM die and the JHS see that and copies the jhist and similar files to the done_intermediate dir. I have tested this by running jobs and killing the AM. This results in incomplete information, as expected; however, in some cases some of the information won’t make 100% sense or is missing (e.g. no Finish Time if the AM didn’t actually finish). I’ve put in some code to take care of these situations. I’ve also attached a preliminary YARN patch to YARN-1731 . How will the JHS copy the file to the intermediate directory? It likely won't have access to the staging directory containing the jhist file. I modified the permissions from 0700 to 0701.
          Hide
          Jason Lowe added a comment -

          How will the JHS copy the file to the intermediate directory? It likely won't have access to the staging directory containing the jhist file.

          Show
          Jason Lowe added a comment - How will the JHS copy the file to the intermediate directory? It likely won't have access to the staging directory containing the jhist file.
          Hide
          Karthik Kambatla added a comment -

          Proposal makes sense to me. Do you want to open a YARN JIRA for the YARN-specific changes?

          Show
          Karthik Kambatla added a comment - Proposal makes sense to me. Do you want to open a YARN JIRA for the YARN-specific changes?
          Hide
          Robert Kanter added a comment -

          I propose we solve this issue by doing the following:
          The Resource Manager is aware when the AM fails; when an AM fails, the RM can write a flag file to a new “fail” directory. The JHS periodically scans the "fail" dir for these flag files. When it sees one, it then looks for the History for that failed AM; if found, it copies/moves the History to the intermediate directory, where it will be processed by the JHS normally. If not found, it does nothing. Once done, the JHS can then delete the flag file.
          For the Summary file, most of it is static, so we can simply have the AM write that file out at startup (with 0 or "N/A" for dynamic fields) and then overwrite it at shutdown to get the values for the dynamic fields as it does now. If the AM fails, then the JHS will at least be able to pickup the first version of the Summary file.

          Show
          Robert Kanter added a comment - I propose we solve this issue by doing the following: The Resource Manager is aware when the AM fails; when an AM fails, the RM can write a flag file to a new “fail” directory. The JHS periodically scans the "fail" dir for these flag files. When it sees one, it then looks for the History for that failed AM; if found, it copies/moves the History to the intermediate directory, where it will be processed by the JHS normally. If not found, it does nothing. Once done, the JHS can then delete the flag file. For the Summary file, most of it is static, so we can simply have the AM write that file out at startup (with 0 or "N/A" for dynamic fields) and then overwrite it at shutdown to get the values for the dynamic fields as it does now. If the AM fails, then the JHS will at least be able to pickup the first version of the Summary file.

            People

            • Assignee:
              Robert Kanter
              Reporter:
              Robert Kanter
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Development