Details
Description
I'm getting incorrect results from several window functions:
- negative values from ROW_NUMBER()
- incorrect floats from SUM(...) OVER (...)
- huge integers from COUNT(...) OVER (...)
Here's an example pyspark session that shows the incorrect behaviors:
>>> logout_acc.printSchema() root |-- account: long (nullable = true) |-- avatar: long (nullable = true) |-- timestamp: timestamp (nullable = true) |-- session_dur: double (nullable = true) >>> session_numbers = sqlContext.sql(""" ... SELECT account, avatar, timestamp, ... COUNT(*) OVER ( ... PARTITION BY account ORDER BY timestamp) AS session, ... ROW_NUMBER() OVER ( ... PARTITION BY account ORDER BY timestamp) AS row_num, ... SUM(session_dur) OVER ( ... PARTITION BY account ORDER BY timestamp) AS playtime ... FROM logout_acc ... """) >>> session_numbers.cache().registerTempTable('session_numbers') >>> session_numbers.show() +-------+------+--------------------+-------+-----------+--------------+ |account|avatar| timestamp|session| row_num| playtime| +-------+------+--------------------+-------+-----------+--------------+ | ...1| ...1|2015-09-01 00:04:...| 1| 1261342896| 4.991264E-317| | ...2| ...2|2015-09-01 00:40:...| 1| -900190400| 4.991363E-317| | ...2| ...2|2015-09-01 00:43:...| 2| -695150400| 4.991363E-317| | ...2| ...2|2015-09-01 00:50:...| 3| -278701400| 4.991363E-317| | ...2| ...2|2015-09-01 01:25:...| 4| 1824111600| 4.991363E-317| | ...2| ...2|2015-09-01 01:28:...| 5| 1992217600| 4.991363E-317| | ...2| ...2|2015-09-01 01:31:...| 6|-2127806696| 4.991363E-317| | ...2| ...2|2015-09-03 00:47:...| 7| 574647760| 4.991363E-317| | ...3| ...3|2015-09-01 01:16:...| 1| 1269805600|4.9914617E-317| | ...3| ...3|2015-09-01 01:22:...| 2| 1621270600|4.9914617E-317| | ...3| ...3|2015-09-01 14:52:...| 3|-1269231952|4.9914617E-317| | ...3| ...3|2015-09-01 14:59:...| 4| -865949952|4.9914617E-317| | ...3| ...3|2015-09-04 05:05:...| 5| -678665344|4.9914617E-317| | ...3| ...3|2015-09-05 16:19:...| 6| 1645213072|4.9914617E-317| | ...3| ...3|2015-09-05 17:09:...| 7| 339651776|4.9914617E-317| | ...3| ...3|2015-09-05 20:27:...| 8| -669300112|4.9914617E-317| | ...3| ...3|2015-09-05 21:12:...| 9| 2032531888|4.9914617E-317| | ...3| ...3|2015-09-09 02:45:...| 10| 2046885648|4.9914617E-317| | ...3| ...3|2015-09-09 03:47:...| 11| 1493942352|4.9914617E-317| | ...3| ...3|2015-09-09 21:09:...| 12| -434263088|4.9914617E-317| +-------+------+--------------------+-------+-----------+--------------+ >>> sqlContext.sql('select max(session) from session_numbers').show() +-------------------+ | _c0| +-------------------+ |4687099421149855744| +-------------------+
In fact, this was a transient problem, because after I re-ran the above commands, the results came out correct. Unfortunately I also cannot share the data to try and reproduce the problem, but I hope this report might be useful nevertheless. Thanks.
Attachments
Issue Links
- duplicates
-
SPARK-11009 RowNumber in HiveContext returns negative values in cluster mode
- Resolved