[SPARK-6748] QueryPlan.schema should be a lazy val to avoid creating excessive duplicate StructType objects - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.0
Fix Version/s: 1.4.0
Component/s: None
Labels:
None

Target Version/s:

1.3.2, 1.4.0

Description

Spotted this issue while trying to do a simple micro benchmark:

sc.parallelize(1 to 10000000).
  map(i => (i, s"val_$i")).
  toDF("key", "value").
  saveAsParquetFile("file:///tmp/src.parquet")

sqlContext.parquetFile("file:///tmp/src.parquet").collect()

YJP profiling result showed that, 10 million StructType, 10 million StructField [], and 20 million StructField were allocated.

It turned out that DataFrame.collect() calls SparkPlan.executeCollect(), which consists of a single line:

execute().map(ScalaReflection.convertRowToScala(_, schema)).collect()

The problem is that, QueryPlan.schema is a function, and since 1.3.0, convertRowToScala starts returning a GenericRowWithSchema. These two facts result in 10 million rows, each with a separate schema object.

Attachments

Issue Links

links to

[Github] Pull Request #5398 (liancheng)

Activity

People

Assignee:: Cheng Lian

Reporter:: Cheng Lian

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Apr/15 17:39

Updated:: 24/Apr/15 00:28

Resolved:: 07/Apr/15 23:15