Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17629

Add local version of Word2Vec findSynonyms for spark.ml

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.2.0
    • 2.2.0
    • ML
    • None

    Description

      ml Word2Vec's findSynonyms methods depart from mllib in that they return distributed results, rather than the results directly:

        def findSynonyms(word: String, num: Int): DataFrame = {
          val spark = SparkSession.builder().getOrCreate()
          spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", "similarity")
        }
      

      What was the reason for this decision? I would think that most users would request a reasonably small number of results back, and want to use them directly on the driver, similar to the take method on dataframes. Returning parallelized results creates a costly round trip for the data that doesn't seem necessary.

      The original PR: https://github.com/apache/spark/pull/7263
      MechCoder - do you perhaps recall the reason?

      Attachments

        Issue Links

          Activity

            People

              akrim Asher Krim
              akrim Asher Krim
              Joseph K. Bradley Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: