[SPARK-23333] SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 2.2.1
Fix Version/s: None
Component/s: ML, MLlib, SQL
Labels:
None

Description

Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes oldDF.first() in order to establish some metadata/attributes: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88. When oldDF is sorted, the above triggering of oldDF.first() can be very slow.

For the purpose of establishing metadata, taking an arbitrary row from oldDF will be just as good as taking oldDF.first(). Is there hence a way we can speed up a great deal by somehow grabbing a random row, instead of relying on oldDF.first()?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: V Luong

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 05/Feb/18 08:09

Updated:: 29/Mar/18 20:56

Resolved:: 29/Mar/18 20:56