Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17629

Add local version of Word2Vec findSynonyms for spark.ml

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.2.0
    • Component/s: ML
    • Labels:
      None
    • Target Version/s:

      Description

      ml Word2Vec's findSynonyms methods depart from mllib in that they return distributed results, rather than the results directly:

        def findSynonyms(word: String, num: Int): DataFrame = {
          val spark = SparkSession.builder().getOrCreate()
          spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", "similarity")
        }
      

      What was the reason for this decision? I would think that most users would request a reasonably small number of results back, and want to use them directly on the driver, similar to the take method on dataframes. Returning parallelized results creates a costly round trip for the data that doesn't seem necessary.

      The original PR: https://github.com/apache/spark/pull/7263
      Manoj Kumar - do you perhaps recall the reason?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                akrim Asher Krim
                Reporter:
                akrim Asher Krim
                Shepherd:
                Joseph K. Bradley
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: