Description
Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes oldDF.first() in order to establish some metadata/attributes: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88. When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
For the purpose of establishing metadata, taking an arbitrary row from oldDF will be just as good as taking oldDF.first(). Is there hence a way we can speed up a great deal by somehow grabbing a random row, instead of relying on oldDF.first()?