Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6116 DataFrame API improvement umbrella ticket (Spark 1.5)
  3. SPARK-7276

withColumn is very slow on dataframe with large number of columns

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.1
    • Fix Version/s: 1.4.0
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:
    • Sprint:
      Spark 1.5 doc/QA sprint

      Description

      The code snippet demonstrates the problem.

          import org.apache.spark.sql._
          import org.apache.spark.sql.types._
      
          val sparkConf = new SparkConf().setAppName("Spark Test").setMaster(System.getProperty("spark.master", "local[4]"))
      
          val sc = new SparkContext(sparkConf)
          val sqlContext = new SQLContext(sc)
          import sqlContext.implicits._
      
          val custs = Seq(
            Row(1, "Bob", 21, 80.5),
            Row(2, "Bobby", 21, 80.5),
            Row(3, "Jean", 21, 80.5),
            Row(4, "Fatime", 21, 80.5)
          )
      
          var fields = List(
            StructField("id", IntegerType, true),
            StructField("a", IntegerType, true),
            StructField("b", StringType, true),
            StructField("target", DoubleType, false))
          val schema = StructType(fields)
      
          var rdd = sc.parallelize(custs)
          var df = sqlContext.createDataFrame(rdd, schema)
      
          for (i <- 1 to 200) {
            val now = System.currentTimeMillis
            df = df.withColumn("a_new_col_" + i, df("a") + i)
            println(s"$i -> " + (System.currentTimeMillis - now))
          }
      
          df.show()
      

        Attachments

          Activity

            People

            • Assignee:
              cloud_fan Wenchen Fan
              Reporter:
              apclement Alexandre CLEMENT
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: