Description
Spotted this issue while trying to do a simple micro benchmark:
sc.parallelize(1 to 10000000). map(i => (i, s"val_$i")). toDF("key", "value"). saveAsParquetFile("file:///tmp/src.parquet") sqlContext.parquetFile("file:///tmp/src.parquet").collect()
YJP profiling result showed that, 10 million StructType, 10 million StructField [], and 20 million StructField were allocated.
It turned out that DataFrame.collect() calls SparkPlan.executeCollect(), which consists of a single line:
execute().map(ScalaReflection.convertRowToScala(_, schema)).collect()
The problem is that, QueryPlan.schema is a function, and since 1.3.0, convertRowToScala starts returning a GenericRowWithSchema. These two facts result in 10 million rows, each with a separate schema object.