Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23333

SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 2.2.1
    • None
    • ML, MLlib, SQL
    • None

    Description

      Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes oldDF.first() in order to establish some metadata/attributes: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88. When oldDF is sorted, the above triggering of oldDF.first() can be very slow.

      For the purpose of establishing metadata, taking an arbitrary row from oldDF will be just as good as taking oldDF.first(). Is there hence a way we can speed up a great deal by somehow grabbing a random row, instead of relying on oldDF.first()?

      Attachments

        Activity

          People

            Unassigned Unassigned
            MBALearnsToCode V Luong
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: