Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19812

YARN shuffle service fails to relocate recovery DB across NFS directories

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.1
    • 2.2.0, 2.3.0
    • Spark Core, YARN
    • None

    Description

      The yarn shuffle service tries to switch from the yarn local directories to the real recovery directory but can fail to move the existing recovery db's. It fails due to Files.move not doing directories that have contents.

      2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move recovery file sparkShuffleRecovery.ldb to the path /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
      java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
      at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
      at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
      at java.nio.file.Files.move(Files.java:1395)
      at org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
      at org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
      at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
      at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
      at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
      at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
      at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
      at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
      at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
      at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
      at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
      at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
      at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)

      This used to use f.renameTo and we switched it in the pr due to review comments and it looks like didn't do a final real test. The tests are using files rather then directories so it didn't catch. We need to fix the test also.

      history: https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440

      Attachments

        Activity

          People

            tgraves Thomas Graves
            tgraves Thomas Graves
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: