Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.0
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:
    • Sprint:
      Spark 1.5 doc/QA sprint

      Description

      Reported by Ram Sriharsha

      import java.util.UUID
      import org.apache.spark.sql._
      import org.apache.spark.sql.types._
      val rdd = sc.parallelize(List(1,2,3), 2)
      val schema = StructType(List(StructField("index",IntegerType,true)))
      val df = sqlContext.createDataFrame(rdd.map(p => Row(p)), schema)
      def id:() => String = () => {UUID.randomUUID().toString()}
      def square:Int => Int = (x: Int) => {x * x}
      val dfWithId = df.withColumn("id",callUDF(id, StringType)).cache() //expect the ID to have materialized at this point
      dfWithId.collect()
      //res0: Array[org.apache.spark.sql.Row] = Array([1,43c7b8e2-b4a3-43ee-beff-0bb4b7d6c1b1], [2,efd061be-e8cc-43fa-956e-cfd6e7355982], [3,79b0baab-627c-4761-af0d-8995b8c5a125])
      
      val dfWithIdAndSquare = dfWithId.withColumn("square",callUDF(square, IntegerType, col("index")))
      dfWithIdAndSquare.collect()
      //res1: Array[org.apache.spark.sql.Row] = Array([1,a3b2e744-a0a1-40fe-8133-87a67660b4ab,1], [2,0a7052a0-6071-4ef5-a25a-2670248ea5cd,4], [3,209f269e-207a-4dfd-a186-738be5db2eff,9])
      //why are the IDs in lines 11 and 15 different?
      

      The randomly generated IDs are the same if show (which uses take under the hood) is used instead of collect.

        Attachments

          Activity

            People

            • Assignee:
              chenghao Cheng Hao
              Reporter:
              rxin Reynold Xin
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: