[SPARK-49288] to_date ... too slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.7, 3.5.2
Fix Version/s: None
Component/s: SQL
Labels:
None
Environment:
Hide

Because this issue has to do with how exceptions are handled in Java, it's not only easy to reproduce, but also it doesn't really matter the environment used. Nevertheless, the issue has been reproduced in the following environments:

Linux - Java 8 - Spark 2.4.7 (Cloudera 7.1.7)

Windows 11 - Java 8 - Scala 2.11.12 - Spark 2.4.7

Windows 11 - Java 11 - Scala 2.11.12 - Spark 2.4.7

Windows 11 - Java 8 - Scala 2.12.17 - Spark 3.5.2

Windows 11 - Java 11 - Scala 2.12.17 - Spark 3.5.2
Show
Because this issue has to do with how exceptions are handled in Java, it's not only easy to reproduce, but also it doesn't really matter the environment used. Nevertheless, the issue has been reproduced in the following environments: Linux - Java 8 - Spark 2.4.7 (Cloudera 7.1.7) Windows 11 - Java 8 - Scala 2.11.12 - Spark 2.4.7 Windows 11 - Java 11 - Scala 2.11.12 - Spark 2.4.7 Windows 11 - Java 8 - Scala 2.12.17 - Spark 3.5.2 Windows 11 - Java 11 - Scala 2.12.17 - Spark 3.5.2

Description

The to_date _ built-in udf is creating new _ParseException instances every time a string value cam't be parsed. ParseException extends Exception, which in turn, extends Throwable, that calls the fillInStackTrace method. This method is not only one of the most expensive methods in Java, but it's also synchronized. That means, that could introduce some overhead.

Here's a stracktrace as example:

java.lang.Throwable.fillInStackTrace(Native Method)
java.lang.Throwable.fillInStackTrace(Throwable.java:783) => holding Monitor(java.text.ParseException@176695786})
java.lang.Throwable.<init>(Throwable.java:265)
java.lang.Exception.<init>(Exception.java:66)
java.text.ParseException.<init>(ParseException.java:63)
java.text.DateFormat.parse(DateFormat.java:366)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.agg_doAggregateWithKeys_0$(Unknown Source)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.processNext(Unknown Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:645)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
org.apache.spark.scheduler.Task.run(Task.scala:123)
org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:413)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1334)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:419)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Because empty strings also throw ParseException errors, trying to parse several date fields with empty values from a large dataframe could take the Spark task much more time than expected (and needed).

The following is strongly suggested:

Add some kind of warning in the to_date documentation page, clearly stating that lots of non-valid string date values will introduce serious performance issues. Note: the same warning should be also included in the documentation for udfs.
Add some kind of check or control to the string values before parsing in order to prevent the parsing of the string and the unnecessary creation of lots of exceptions. At least, for empty values, if checking valid string dates is considered a costly operation.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ángel Álvarez Pascua

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 18/Aug/24 15:50

Updated:: 11/Sep/24 01:44