Uploaded image for project: 'Livy'
  1. Livy
  2. LIVY-586

When a batch fails on startup, Livy continues to report the batch as "starting", even though it has failed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.5.0
    • 0.7.0
    • Batch
    • None
    • AWS EMR, Livy submits batches to YARN in cluster mode

    Description

      When starting a Livy batch, I accidentally provided it a jar location in S3 that did not exist. Livy then continued to report that the job was "starting", even though it had clearly failed.

      stdout:

      2019-04-05 11:24:18,149 [main] WARN org.apache.hadoop.util.NativeCodeLoader [appName=] [jobId=] [clusterId=] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      Warning: Skip remote jar s3://dev-dp-local/jars/develop-fix/ap5-app-transform-0.2-thread-pool-SNAPSHOT.jar.
      2019-04-05 11:24:19,152 [main] INFO org.apache.hadoop.yarn.client.RMProxy [appName=] [jobId=] [clusterId=] - Connecting to ResourceManager at ip-10-25-30-127.dev.cainc.internal/10.25.30.127:8032
      2019-04-05 11:24:19,453 [main] INFO org.apache.spark.deploy.yarn.Client [appName=] [jobId=] [clusterId=] - Requesting a new application from cluster with 6 NodeManagers
      2019-04-05 11:24:19,532 [main] INFO org.apache.spark.deploy.yarn.Client [appName=] [jobId=] [clusterId=] - Verifying our application has not requested more than the maximum memory capability of the cluster (54272 MB per container)
      2019-04-05 11:24:19,533 [main] INFO org.apache.spark.deploy.yarn.Client [appName=] [jobId=] [clusterId=] - Will allocate AM container, with 9011 MB memory including 819 MB overhead
      2019-04-05 11:24:19,534 [main] INFO org.apache.spark.deploy.yarn.Client [appName=] [jobId=] [clusterId=] - Setting up container launch context for our AM
      2019-04-05 11:24:19,537 [main] INFO org.apache.spark.deploy.yarn.Client [appName=] [jobId=] [clusterId=] - Setting up the launch environment for our AM container
      2019-04-05 11:24:19,549 [main] INFO org.apache.spark.deploy.yarn.Client [appName=] [jobId=] [clusterId=] - Preparing resources for our AM container
      2019-04-05 11:24:21,059 [main] WARN org.apache.spark.deploy.yarn.Client [appName=] [jobId=] [clusterId=] - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
      2019-04-05 11:24:23,790 [main] INFO org.apache.spark.deploy.yarn.Client [appName=] [jobId=] [clusterId=] - Uploading resource file:/mnt/tmp/spark-b4e4a760-77a3-4554-a3f3-c3f82675d865/__spark_libs__3639879082942366045.zip -> hdfs://ip-10-25-30-127.dev.cainc.internal:8020/user/livy/.sparkStaging/application_1554234858331_0222/__spark_libs__3639879082942366045.zip
      2019-04-05 11:24:26,817 [main] INFO org.apache.spark.deploy.yarn.Client [appName=] [jobId=] [clusterId=] - Uploading resource s3://dev-dp-local/jars/develop-fix/ap5-app-transform-0.2-thread-pool-SNAPSHOT.jar -> hdfs://ip-10-25-30-127.dev.cainc.internal:8020/user/livy/.sparkStaging/application_1554234858331_0222/ap5-app-transform-0.2-thread-pool-SNAPSHOT.jar
      2019-04-05 11:24:26,940 [main] INFO org.apache.spark.deploy.yarn.Client [appName=] [jobId=] [clusterId=] - Deleted staging directory hdfs://ip-10-25-30-127.dev.cainc.internal:8020/user/livy/.sparkStaging/application_1554234858331_0222
      Exception in thread "main" java.io.FileNotFoundException: No such file or directory 's3://dev-dp-local/jars/develop-fix/ap5-app-transform-0.2-thread-pool-SNAPSHOT.jar'
      	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:805)
      	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:536)
      	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
      	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
      	at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:356)
      	at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478)
      	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:577)
      	at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:576)
      	at scala.Option.foreach(Option.scala:257)
      	at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:576)
      	at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:869)
      	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:169)
      	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1152)
      	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      2019-04-05 11:24:26,964 [pool-1-thread-1] INFO org.apache.spark.util.ShutdownHookManager [appName=] [jobId=] [clusterId=] - Shutdown hook called
      2019-04-05 11:24:26,965 [pool-1-thread-1] INFO org.apache.spark.util.ShutdownHookManager [appName=] [jobId=] [clusterId=] - Deleting directory /mnt/tmp/spark-aa8e8eff-ca2c-4358-a24f-19eb3863ef8f
      2019-04-05 11:24:26,966 [pool-1-thread-1] INFO org.apache.spark.util.ShutdownHookManager [appName=] [jobId=] [clusterId=] - Deleting directory /mnt/tmp/spark-b4e4a760-77a3-4554-a3f3-c3f82675d865
      

      stderr is empty

      YARN Diagnostics eventually warns that the tag for the batch can't be found after 900 seconds.

      Attachments

        Issue Links

          Activity

            People

              runzhiwang Jie Wang
              sbrougher Sam Brougher
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m