Uploaded image for project: 'Apache Airflow'
  1. Apache Airflow
  2. AIRFLOW-5385

SparkSubmit status spend lot of time

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Reopened
    • Priority: Blocker
    • Resolution: Unresolved
    • Affects Version/s: 1.10.2
    • Fix Version/s: None
    • Component/s: contrib
    • Labels:
      None

      Description

      Hello,

      we have an issue with SparkSubmitOperator.  Airflow DAGs shows that some streaming applications breaks out. I analyzed this behaviour. The SparkSubmitHook is the responsable of check the driver status.

      We discovered some timeouts and tried to reproduce checking command. This is an execution with `time`:

      time /opt/java/jdk1.8.0_181/jre/bin/java -cp /opt/shared/spark/client/conf/:/opt/shared/spark/client/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://spark-master.corp.com:6066 --status driver-20190901180337-2749 
      Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
      19/09/02 17:05:53 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20190901180337-2749 in spark://lgmadbdtpspk01v.corp.logitravelgroup.com:6066.
      19/09/02 17:05:59 INFO RestSubmissionClient: Server responded with SubmissionStatusResponse:
      {
        "action" : "SubmissionStatusResponse",
        "driverState" : "RUNNING",
        "serverSparkVersion" : "2.2.1",
        "submissionId" : "driver-20190901180337-2749",
        "success" : true,
        "workerHostPort" : "172.25.10.194:45441",
        "workerId" : "worker-20190821201014-172.25.10.194-45441"
      }
      
      real 0m11.598s 
      user 0m2.092s 
      sys 0m0.222s

      We analyzed the Scala code and Spark API. This spark-submit status command ends with a http get request to an url. Using curl, this is the time spent by spark master to return status:

       time curl "http://spark-master.corp.com:6066/v1/submissions/status/driver-20190901180337-2749"
      {
        "action" : "SubmissionStatusResponse",
        "driverState" : "RUNNING",
        "serverSparkVersion" : "2.2.1",
        "submissionId" : "driver-20190901180337-2749",
        "success" : true,
        "workerHostPort" : "172.25.10.194:45441",
        "workerId" : "worker-20190821201014-172.25.10.194-45441"
      }
      real	0m0.011s
      user	0m0.000s
      sys	0m0.006s
      

      Task spends 11.59 seconds with spark submit versus 0.011seconds with curl

      How can be this behaviour explained?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                sergio.soto Sergio Soto
              • Votes:
                1 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: