https://issues.apache.org/jira/browse/SPARK-23549 made Spark's handling of date vs. timestamp comparison consistent with SQL, which, unfortunately, isn't consistent with common sense.
When dates are compared with timestamps, they are promoted to timestamps at midnight of the date, in the server timezone, which is almost always UTC. This only works well if all timestamps in the data are logically time instants as opposed to dates + times, which only become instants with a known timezone.
The fundamental issue is that dates are a human time concept and instant are a machine time concept. While we can technically promote one to the other, logically, it only works 100% if midnight for all dates in the system is in the server timezone.
Every major modern platform offers a clear distinction between machine time (instants) and human time (an instant with a timezone, UTC offset, etc.), because we have learned the hard way that date & time handling is a never-ending source of confusion and bugs. SQL, being an ancient language (40+ years old), is well behind software engineering best practices; using it as a guiding light is necessary for Spark to win market share, but unfortunate in every other way.
For example, Java has:
I am not suggesting we add new data types to Spark. I am suggesting we go to the heart of the matter, which is that most date vs. time handling issues are the result of confusion or carelessness.
What about introducing a new setting that makes comparisons between dates and timestamps illegal, preferably with a helpful exception message?
If it existed, I would certainly make it the default for all our clusters. The minor coding convenience that comes from being able to compare dates & timestamps with an automatic type promotion pales in comparison with the risk of subtle bugs that remain undetected for a long time.