Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6748

QueryPlan.schema should be a lazy val to avoid creating excessive duplicate StructType objects

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.0
    • 1.4.0
    • None
    • None

    Description

      Spotted this issue while trying to do a simple micro benchmark:

      sc.parallelize(1 to 10000000).
        map(i => (i, s"val_$i")).
        toDF("key", "value").
        saveAsParquetFile("file:///tmp/src.parquet")
      
      sqlContext.parquetFile("file:///tmp/src.parquet").collect()
      

      YJP profiling result showed that, 10 million StructType, 10 million StructField [], and 20 million StructField were allocated.

      It turned out that DataFrame.collect() calls SparkPlan.executeCollect(), which consists of a single line:

      execute().map(ScalaReflection.convertRowToScala(_, schema)).collect()
      

      The problem is that, QueryPlan.schema is a function, and since 1.3.0, convertRowToScala starts returning a GenericRowWithSchema. These two facts result in 10 million rows, each with a separate schema object.

      Attachments

        Activity

          People

            lian cheng Cheng Lian
            lian cheng Cheng Lian
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: