Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6748

QueryPlan.schema should be a lazy val to avoid creating excessive duplicate StructType objects

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.4.0
    • Component/s: None
    • Labels:
      None

      Description

      Spotted this issue while trying to do a simple micro benchmark:

      sc.parallelize(1 to 10000000).
        map(i => (i, s"val_$i")).
        toDF("key", "value").
        saveAsParquetFile("file:///tmp/src.parquet")
      
      sqlContext.parquetFile("file:///tmp/src.parquet").collect()
      

      YJP profiling result showed that, 10 million StructType, 10 million StructField [], and 20 million StructField were allocated.

      It turned out that DataFrame.collect() calls SparkPlan.executeCollect(), which consists of a single line:

      execute().map(ScalaReflection.convertRowToScala(_, schema)).collect()
      

      The problem is that, QueryPlan.schema is a function, and since 1.3.0, convertRowToScala starts returning a GenericRowWithSchema. These two facts result in 10 million rows, each with a separate schema object.

        Attachments

          Activity

            People

            • Assignee:
              lian cheng Cheng Lian
              Reporter:
              lian cheng Cheng Lian
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: