In Spark 2.0.0, the window(Column timeColumn, String windowDuration, String slideDuration, String startTime) function startTime parameter behaves as follows:
startTime - The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes.
Given a windowDuration of 1 day and a startTime of 0h, I'd expect to see events from each day fall into the correct day's bucket. This doesn't happen as expected in every case, however, due to the way that this feature and timestamp / timezone support interact.
Using a fixed UTC reference, there is an assumption that all days are the same length (1 day === 24 h}). This is not the case for most timezones where the offset from UTC changes by 1h for 6 months out of the year. In this case, on the days that clocks go forward/back, one day is 23h long, 1 day is 25h long.
The result of this is that, for daylight savings time, some rows within 1h of midnight are aggregated to the wrong day.
Options for a fix are:
- This is the expected behavior, and the window() function should not be used for this type of aggregation with a long window length and the window function documentation should be updated as such, or
- The window function should respect timezones and work on the assumption that 1 day !== 24 h. The startTime should be updated to snap to the local timezone, rather than UTC.
- Support for both absolute and relative window lengths may be added