[SPARK-11481] orderBy with multiple columns in WindowSpec does not work properly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.1
Fix Version/s: 1.5.2, 1.6.0
Component/s: PySpark, SQL
Labels:
- DataFrame
- sparkSQL
Environment:

All

Target Version/s:

1.5.2, 1.6.0
Flags:

Patch, Important

Description

When using multiple columns in the orderBy of a WindowSpec the order by seems to work only for the first column.

A possible workaround is to sort previosly the DataFrame and then apply the window spec over the sorted DataFrame

e.g.
THIS NOT WORKS:
window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)

df = df.withColumn('user_version', func.sum(df.group_counter).over(window_sum))

THIS WORKS WELL:
df = df.sort('user_unique_id', 'creation_date', 'mib_id', 'day')
window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)

df = df.withColumn('user_version', func.sum(df.group_counter).over(window_sum))

Also, can anybody confirm that this is a true workaround?

Attachments

Issue Links

is duplicated by

SPARK-11009 RowNumber in HiveContext returns negative values in cluster mode

Resolved

Activity

People

Assignee:: Davies Liu

Reporter:: Jose Antonio

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Nov/15 17:19

Updated:: 11/Nov/15 18:15

Resolved:: 11/Nov/15 18:15