Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.4.2
Description
It probably impacts a lot more versions of the code than this, but I verified it on 3.4.2. This also appears to be related to https://issues.apache.org/jira/browse/SPARK-39209
scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) +--------------------+-----------------------------+--------------------+ |v |ts |unix_micros(ts) | +--------------------+-----------------------------+--------------------+ |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| |0 |1970-01-01 00:00:00 |0 | |1990 |1970-01-01 00:33:10 |1990000000 | +--------------------+-----------------------------+--------------------+ scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) +--------------------+-------------------+---------------+ |v |ts |unix_micros(ts)| +--------------------+-------------------+---------------+ |9223372036854775807 |1969-12-31 23:59:59|-1000000 | |-9223372036854775808|1970-01-01 00:00:00|0 | |0 |1970-01-01 00:00:00|0 | |1990 |1970-01-01 00:33:10|1990000000 | +--------------------+-------------------+---------------+
It looks like InMemoryTableScanExec is not doing code generation for the expressions, but the ProjectExec after the repartition is.
If I disable code gen I get the same answer in both cases.
scala> spark.conf.set("spark.sql.codegen.wholeStage", false) scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN") scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) +--------------------+-----------------------------+--------------------+ |v |ts |unix_micros(ts) | +--------------------+-----------------------------+--------------------+ |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| |0 |1970-01-01 00:00:00 |0 | |1990 |1970-01-01 00:33:10 |1990000000 | +--------------------+-----------------------------+--------------------+ scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) +--------------------+-----------------------------+--------------------+ |v |ts |unix_micros(ts) | +--------------------+-----------------------------+--------------------+ |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| |0 |1970-01-01 00:00:00 |0 | |1990 |1970-01-01 00:33:10 |1990000000 | +--------------------+-----------------------------+--------------------+
Is the code used in codegen, but
is what is used outside of code gen.
Apparently `SECONDS.toMicros` truncates the value on an overflow, but the codegen does not.
scala> Long.MaxValue res11: Long = 9223372036854775807 scala> java.util.concurrent.TimeUnit.SECONDS.toMicros(Long.MaxValue) res12: Long = 9223372036854775807 scala> Long.MaxValue * (1000L * 1000L) res13: Long = -1000000
Ideally these would be consistent with each other. I personally would prefer the truncation as it feels more accurate, but I am fine either way.
Attachments
Issue Links
- links to