Spotted this issue while trying to do a simple micro benchmark:
YJP profiling result showed that, 10 million StructType, 10 million StructField , and 20 million StructField were allocated.
It turned out that DataFrame.collect() calls SparkPlan.executeCollect(), which consists of a single line:
The problem is that, QueryPlan.schema is a function, and since 1.3.0, convertRowToScala starts returning a GenericRowWithSchema. These two facts result in 10 million rows, each with a separate schema object.