Description
ml Word2Vec's findSynonyms methods depart from mllib in that they return distributed results, rather than the results directly:
def findSynonyms(word: String, num: Int): DataFrame = { val spark = SparkSession.builder().getOrCreate() spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", "similarity") }
What was the reason for this decision? I would think that most users would request a reasonably small number of results back, and want to use them directly on the driver, similar to the take method on dataframes. Returning parallelized results creates a costly round trip for the data that doesn't seem necessary.
The original PR: https://github.com/apache/spark/pull/7263
MechCoder - do you perhaps recall the reason?
Attachments
Issue Links
- relates to
-
SPARK-19866 Add local version of Word2Vec findSynonyms for spark.ml: Python API
- Resolved
- links to