Description
Hello!
I'm using c4.8xlarge instances on EC2 with 36 cores and doing lots of parse_url(url, "host") in Spark SQL.
Unfortunately it seems that there is an internal thread-safe cache in there, and the instances end up being 90% idle.
When I view the thread dump for my executors, most of the executor threads are "BLOCKED", in that state:
java.util.Hashtable.get(Hashtable.java:362) java.net.URL.getURLStreamHandler(URL.java:1135) java.net.URL.<init>(URL.java:599) java.net.URL.<init>(URL.java:490) java.net.URL.<init>(URL.java:439) org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) org.apache.spark.scheduler.Task.run(Task.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745)
However, when I switch from 1 executor with 36 cores to 9 executors with 4 cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
Thanks!
Attachments
Issue Links
- incorporates
-
SPARK-23056 parse_url regression when switched to using java.net.URI instead of java.net.URL
- Resolved
- links to