Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3727

For better error recovery, check if the directory exists before using it for localization.

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      For better error recovery, check if the directory exists before using it for localization.
      We saw the following localization failure happened due to existing cache directories.

      2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://XXXX/XXXXX/libjars/1234.jar, 1431395961545, FILE, null }, Rename cannot overwrite non empty destination directory /XXXX/8/yarn/nm/usercache/XXXX/filecache/21637
      2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://XXXX/XXXXX/libjars/1234.jar(->/XXXX/8/yarn/nm/usercache/XXXX/filecache/21637/1234.jar) transitioned from DOWNLOADING to FAILED
      

      The real cause for this failure may be disk failure, LevelDB operation failure for startResourceLocalization/finishResourceLocalization or others.

      I wonder whether we can add error recovery code to avoid the localization failure by not using the existing cache directories for localization.

      The exception happened at files.rename(dst_work, destDirPath, Rename.OVERWRITE) in FSDownload#call. Based on the following code, after the exception, the existing cache directory used by LocalizedResource will be deleted.

      try {
           .........
            files.rename(dst_work, destDirPath, Rename.OVERWRITE);
          } catch (Exception e) {
            try {
              files.delete(destDirPath, true);
            } catch (IOException ignore) {
            }
            throw e;
          } finally {
      

      Since the conflicting local directory will be deleted after localization failure,
      I think it will be better to check if the directory exists before using it for localization to avoid the localization failure.

      Attachments

        1. YARN-3727.000.patch
          14 kB
          Zhihai Xu
        2. YARN-3727.001.patch
          15 kB
          Zhihai Xu

        Activity

          People

            zxu Zhihai Xu
            zxu Zhihai Xu
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: