[SPARK-15382] monotonicallyIncreasingId doesn't work when data is upsampled - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.1
Fix Version/s: 2.0.1, 2.1.0
Component/s: SQL
Labels:
None

Description

Assigned ids are not unique

from pyspark.sql import Row
from pyspark.sql.functions import monotonicallyIncreasingId

hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, 10.0).withColumn('id', monotonicallyIncreasingId()).collect()

Output:

[Row(a=1, id=429496729600),
 Row(a=1, id=429496729600),
 Row(a=1, id=429496729600),
 Row(a=1, id=429496729600),
 Row(a=1, id=429496729600),
 Row(a=1, id=429496729600),
 Row(a=1, id=429496729600),
 Row(a=2, id=867583393792),
 Row(a=2, id=867583393792),
 Row(a=2, id=867583393792),
 Row(a=2, id=867583393792),
 Row(a=2, id=867583393792),
 Row(a=2, id=867583393792),
 Row(a=2, id=867583393792),
 Row(a=2, id=867583393792),
 Row(a=2, id=867583393792),
 Row(a=2, id=867583393792),
 Row(a=2, id=867583393792)]

Attachments

Issue Links

duplicates

SPARK-16686 Dataset.sample with seed: result seems to depend on downstream usage

Resolved

links to

[Github] Pull Request #14181 (maropu)

[Github] Pull Request #14800 (maropu)

Activity

Ascending order - Click to sort in descending order

Hyukjin Kwon added a comment - 12/Jul/16 10:43 - edited

This also happends in master branch (2.1.0)

here is the shorten version in Scala I tested.

spark.range(2).sample(true, 10.0).withColumn("mid", monotonically_increasing_id).show()

Hyukjin Kwon added a comment - 12/Jul/16 10:43 - edited This also happends in master branch (2.1.0) here is the shorten version in Scala I tested. spark.range(2).sample( true , 10.0).withColumn( "mid" , monotonically_increasing_id).show()

Takeshi Yamamuro added a comment - 13/Jul/16 00:29

hyukjin.kwon Do you take this?

Takeshi Yamamuro added a comment - 13/Jul/16 00:29 hyukjin.kwon Do you take this?

Hyukjin Kwon added a comment - 13/Jul/16 01:34 - edited

I was just looking into this but don't mind if you open a PR

Hyukjin Kwon added a comment - 13/Jul/16 01:34 - edited I was just looking into this but don't mind if you open a PR

Takeshi Yamamuro added a comment - 13/Jul/16 01:45

okay, thanks I'll check this.

Takeshi Yamamuro added a comment - 13/Jul/16 01:45 okay, thanks I'll check this.

Apache Spark added a comment - 13/Jul/16 15:21

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/14181

Apache Spark added a comment - 13/Jul/16 15:21 User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/14181

Takeshi Yamamuro added a comment - 19/Aug/16 10:23 - edited

rxin viirya Seems this ticket has already been fixed in ~~SPARK-16686~~.
Can we close this?

Takeshi Yamamuro added a comment - 19/Aug/16 10:23 - edited rxin viirya Seems this ticket has already been fixed in SPARK-16686 . Can we close this?

Takeshi Yamamuro added a comment - 25/Aug/16 02:52

Sorry, but the master still has this bug.
I made a pr, so could you check this?

Takeshi Yamamuro added a comment - 25/Aug/16 02:52 Sorry, but the master still has this bug. I made a pr, so could you check this?

Apache Spark added a comment - 25/Aug/16 02:53

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/14800

Apache Spark added a comment - 25/Aug/16 02:53 User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/14800

People

Assignee:: Takeshi Yamamuro

Reporter:: Mateusz Buśkiewicz

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/May/16 12:23

Updated:: 12/Dec/22 18:10

Resolved:: 19/Aug/16 17:12