Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21207

ML/MLLIB Save Word2Vec Yarn Cluster

    XMLWordPrintableJSON

Details

    • Question
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 2.0.1
    • None
    • ML, MLlib, PySpark, Spark Core, YARN
    • None
    • OS : CentOS Linux release 7.3.1611 (Core)

      Clusters :

      • vendor_id : GenuineIntel
      • cpu family : 6
      • model : 79
      • model name : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

    Description

      Hello everyone,

      I have a question about ML and MLLIB libraries for Word2Vec because I have a problem to save a model in Yarn Cluster,

      I already work with word2vec (MLLIB) :

      from pyspark import SparkContext
      from pyspark.mllib.feature import Word2Vec
      from pyspark.mllib.feature import Word2VecModel

      sc = SparkContext()
      inp = sc.textFile(pathCorpus).map(lambda row: row.split(" "))
      word2vec = Word2Vec().setVectorSize(k).setNumIterations(itera)

      model = word2vec.fit(inp)
      model.save(sc, pathModel)

      This code works well in cluster yarn when I use spark-submit like :
      spark-submit --conf spark.driver.maxResultSize=2G --master yarn --deploy-mode cluster --driver-memory 16G --executor-memory 10G --num-executors 10 --executor-cores 4 MyCode.py

      *But I want to use the new Library ML so I do that : *

      from pyspark import SparkContext
      from pyspark.sql import SQLContext
      from pyspark.sql.functions import explode, split
      from pyspark.ml.feature import Word2Vec
      from pyspark.ml.feature import Word2VecModel
      import numpy as np

      pathModel = "hdfs:///user/test/w2v.model"

      sc = SparkContext(appName = 'Test_App')
      sqlContext = SQLContext(sc)

      raw_text = sqlContext.read.text(corpusPath).select(split("value", " ")).toDF("words")

      numPart = raw_text.rdd.getNumPartitions() - 1

      word2Vec = Word2Vec(vectorSize= k, inputCol="words", outputCol="features", minCount = minCount, maxIter= itera).setNumPartitions(numPart)
      model = word2Vec.fit(raw_text)

      model.findSynonyms("Paris", 20).show()

      model.save(pathModel)

      This code works in local mode but when I try to deploy in clusters mode (like previously) I have a problem because when one cluster writes in hdfs folder the other cannot write inside, so at the end I have an empty folder instead of a plenty of parquet file like in MLLIB. I don't understand because it works with MLLIB but not in ML with the same config when I submitting my code.

      Do you have an idea, how I can solve this problem ?

      I hope I was clear enough.

      Thanks,

      Attachments

        Activity

          People

            Unassigned Unassigned
            offvolt offvolt
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: