Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19812

YARN shuffle service fails to relocate recovery DB across NFS directories

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.2.0, 2.3.0
    • Component/s: YARN
    • Labels:
      None

      Description

      The yarn shuffle service tries to switch from the yarn local directories to the real recovery directory but can fail to move the existing recovery db's. It fails due to Files.move not doing directories that have contents.

      2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move recovery file sparkShuffleRecovery.ldb to the path /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
      java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
      at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
      at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
      at java.nio.file.Files.move(Files.java:1395)
      at org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
      at org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
      at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
      at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
      at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
      at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
      at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
      at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
      at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
      at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
      at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
      at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)

      This used to use f.renameTo and we switched it in the pr due to review comments and it looks like didn't do a final real test. The tests are using files rather then directories so it didn't catch. We need to fix the test also.

      history: https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440

        Attachments

          Activity

            People

            • Assignee:
              tgraves Thomas Graves
              Reporter:
              tgraves Thomas Graves
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: