Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16826

java.util.Hashtable limits the throughput of PARSE_URL()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.1.0
    • SQL
    • None

    Description

      Hello!

      I'm using c4.8xlarge instances on EC2 with 36 cores and doing lots of parse_url(url, "host") in Spark SQL.

      Unfortunately it seems that there is an internal thread-safe cache in there, and the instances end up being 90% idle.

      When I view the thread dump for my executors, most of the executor threads are "BLOCKED", in that state:

      java.util.Hashtable.get(Hashtable.java:362)
      java.net.URL.getURLStreamHandler(URL.java:1135)
      java.net.URL.<init>(URL.java:599)
      java.net.URL.<init>(URL.java:490)
      java.net.URL.<init>(URL.java:439)
      org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
      org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
      org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
      org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
      org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
      org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
      org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
      org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
      scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
      org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
      org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
      scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
      org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
      org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
      org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
      org.apache.spark.scheduler.Task.run(Task.scala:85)
      org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      java.lang.Thread.run(Thread.java:745)
      

      However, when I switch from 1 executor with 36 cores to 9 executors with 4 cores, throughput is almost 10x higher and the CPUs are back at ~100% use.

      Thanks!

      Attachments

        Issue Links

          Activity

            People

              sylvinus Sylvain Zimmer
              sylvinus Sylvain Zimmer
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: