Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
HIVE-4262 changed the partitioning behavior of the last_value function. For a specified last_value() OVER X. The ordering spec within X is used in addition to the partition spec for partitioning. i.e. last_value(a) OVER (PARTITION BY i ORDER BY j) operates last_value(a) on all rows within the unique combination of (i,j). The behavior I'd expect is for PARTITION BY i to define the partitioning, and ORDER BY to define the ordering within the PARTITION. i.e. last_value(a) OVER (PARTITION BY i ORDER BY j) should operate last_value(a) on all rows within the unique values of , ordered by j within the partition.
This was changed to be consistent with how SQLServer handled such queries. SQLServer Docs describe their example (which performs as Hive does):
The PARTITION BY clause partitions the employees by department and the LAST_VALUE function is applied to each partition independently. The ORDER BY clause specified in the OVER clause determines the logical order in which the LAST_VALUE function is applied to the rows in each partition.
To me, their behavior is inconsistent with their description. I've filled an upstream bug with Microsoft for the same.
Oracle and Redshift both exhibit the behavior I'd expect.
Considering Hive-4262 has been in core for 2+ years, I don't think we can change the behavior without potentially impacting clients. But I would like a way to enable the expected behavior at the least (behind a config flag maybe?). What do you think?