[SPARK-5863] Improve performance of convertToScala codepath. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.2.0, 1.2.1
Fix Version/s: None
Component/s: SQL
Labels:
None

Target Version/s:

1.4.0

Description

Was doing some perf testing on reading parquet files and noticed that moving from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the culprit showed up as being in ScalaReflection.convertRowToScala.

Particularly this zip is the issue:

r.toSeq.zip(schema.fields.map(_.dataType))

I see there's a comment on that currently that this is slow but it wasn't fixed. This actually produces a 3x degradation in parquet read performance, at least in my test case.

Edit: the map is part of the issue as well. This whole code block is in a tight loop and allocates a new ListBuffer that needs to grow for each transformation. A possible solution is to change to using seq.view which would allocate iterators instead.

Attachments

Issue Links

duplicates

SPARK-6620 Speed up toDF() and rdd() functions by constructing converters in ScalaReflection

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Cristian Opris

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Feb/15 17:21

Updated:: 12/Apr/15 18:47

Resolved:: 12/Apr/15 18:47