[SPARK-16686] Dataset.sample with seed: result seems to depend on downstream usage - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.2, 2.0.0
Fix Version/s: 2.0.1, 2.1.0
Component/s: SQL
Labels:
None
Environment:

Spark 1.6.2 and Spark 2.0 - RC4
Standalone
Single-worker cluster

Description

Summary to reproduce bug:

Create a DataFrame DF, and sample it with a fixed seed.
Collect that DataFrame -> result1
Call a particular UDF on that DataFrame -> result2

You would expect results 1 and 2 to use the same rows from DF, but they appear not to.
Note: result1 and result2 are both deterministic.

See the attached notebook for details. Cells in the notebook were executed in order.

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

DataFrame.sample bug - 2.0.html
22/Jul/16 17:59
71 kB
Joseph K. Bradley

Issue Links

is duplicated by

SPARK-15382 monotonicallyIncreasingId doesn't work when data is upsampled

Closed

links to

[Github] Pull Request #14327 (viirya)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Jul/16 17:59

Updated:: 19/Aug/16 18:19

Resolved:: 26/Jul/16 04:06