[SPARK-47063] CAST long to timestamp has different behavior for codegen vs interpreted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.2
Fix Version/s: 4.0.0, 3.5.2, 3.4.3
Component/s: SQL
Labels:
- pull-request-available

Description

It probably impacts a lot more versions of the code than this, but I verified it on 3.4.2. This also appears to be related to https://issues.apache.org/jira/browse/SPARK-39209

scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
+--------------------+-----------------------------+--------------------+
|v                   |ts                           |unix_micros(ts)     |
+--------------------+-----------------------------+--------------------+
|9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
|-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
|0                   |1970-01-01 00:00:00          |0                   |
|1990                |1970-01-01 00:33:10          |1990000000          |
+--------------------+-----------------------------+--------------------+
scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
+--------------------+-------------------+---------------+
|v                   |ts                 |unix_micros(ts)|
+--------------------+-------------------+---------------+
|9223372036854775807 |1969-12-31 23:59:59|-1000000       |
|-9223372036854775808|1970-01-01 00:00:00|0              |
|0                   |1970-01-01 00:00:00|0              |
|1990                |1970-01-01 00:33:10|1990000000     |
+--------------------+-------------------+---------------+

It looks like InMemoryTableScanExec is not doing code generation for the expressions, but the ProjectExec after the repartition is.

If I disable code gen I get the same answer in both cases.

scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN")
scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
+--------------------+-----------------------------+--------------------+
|v                   |ts                           |unix_micros(ts)     |
+--------------------+-----------------------------+--------------------+
|9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
|-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
|0                   |1970-01-01 00:00:00          |0                   |
|1990                |1970-01-01 00:33:10          |1990000000          |
+--------------------+-----------------------------+--------------------+
scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false)
+--------------------+-----------------------------+--------------------+
|v                   |ts                           |unix_micros(ts)     |
+--------------------+-----------------------------+--------------------+
|9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 |
|-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808|
|0                   |1970-01-01 00:00:00          |0                   |
|1990                |1970-01-01 00:33:10          |1990000000          |
+--------------------+-----------------------------+--------------------+

https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627

Is the code used in codegen, but

https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687

is what is used outside of code gen.

Apparently `SECONDS.toMicros` truncates the value on an overflow, but the codegen does not.

scala> Long.MaxValue
res11: Long = 9223372036854775807
scala> java.util.concurrent.TimeUnit.SECONDS.toMicros(Long.MaxValue)
res12: Long = 9223372036854775807
scala> Long.MaxValue * (1000L * 1000L)
res13: Long = -1000000

Ideally these would be consistent with each other. I personally would prefer the truncation as it feels more accurate, but I am fine either way.

Attachments

Issue Links

links to

GitHub Pull Request #45294

Activity

People

Assignee:: Pablo Langa Blanco

Reporter:: Robert Joseph Evans

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Feb/24 16:30

Updated:: 17/Sep/24 14:19

Resolved:: 28/Feb/24 03:45