[SPARK-7158] collect and take return different results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
None

Target Version/s:

1.5.0
Sprint:
Spark 1.5 doc/QA sprint

Description

Reported by rams

import java.util.UUID
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(List(1,2,3), 2)
val schema = StructType(List(StructField("index",IntegerType,true)))
val df = sqlContext.createDataFrame(rdd.map(p => Row(p)), schema)
def id:() => String = () => {UUID.randomUUID().toString()}
def square:Int => Int = (x: Int) => {x * x}
val dfWithId = df.withColumn("id",callUDF(id, StringType)).cache() //expect the ID to have materialized at this point
dfWithId.collect()
//res0: Array[org.apache.spark.sql.Row] = Array([1,43c7b8e2-b4a3-43ee-beff-0bb4b7d6c1b1], [2,efd061be-e8cc-43fa-956e-cfd6e7355982], [3,79b0baab-627c-4761-af0d-8995b8c5a125])

val dfWithIdAndSquare = dfWithId.withColumn("square",callUDF(square, IntegerType, col("index")))
dfWithIdAndSquare.collect()
//res1: Array[org.apache.spark.sql.Row] = Array([1,a3b2e744-a0a1-40fe-8133-87a67660b4ab,1], [2,0a7052a0-6071-4ef5-a25a-2670248ea5cd,4], [3,209f269e-207a-4dfd-a186-738be5db2eff,9])
//why are the IDs in lines 11 and 15 different?

The randomly generated IDs are the same if show (which uses take under the hood) is used instead of collect.

Attachments

Issue Links

links to

[Github] Pull Request #5714 (chenghao-intel)

Activity

People

Assignee:: Cheng Hao

Reporter:: Reynold Xin

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Apr/15 00:42

Updated:: 12/Jun/15 01:20

Resolved:: 12/Jun/15 01:20