Description
Assigned ids are not unique
from pyspark.sql import Row from pyspark.sql.functions import monotonicallyIncreasingId hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, 10.0).withColumn('id', monotonicallyIncreasingId()).collect()
Output:
[Row(a=1, id=429496729600), Row(a=1, id=429496729600), Row(a=1, id=429496729600), Row(a=1, id=429496729600), Row(a=1, id=429496729600), Row(a=1, id=429496729600), Row(a=1, id=429496729600), Row(a=2, id=867583393792), Row(a=2, id=867583393792), Row(a=2, id=867583393792), Row(a=2, id=867583393792), Row(a=2, id=867583393792), Row(a=2, id=867583393792), Row(a=2, id=867583393792), Row(a=2, id=867583393792), Row(a=2, id=867583393792), Row(a=2, id=867583393792), Row(a=2, id=867583393792)]
Attachments
Issue Links
- duplicates
-
SPARK-16686 Dataset.sample with seed: result seems to depend on downstream usage
- Resolved
- links to