Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24666

Word2Vec generate infinity vectors when numIterations are large

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.3.1, 2.4.4
    • Fix Version/s: 2.4.5, 3.1.0
    • Component/s: ML, MLlib
    • Labels:
      None
    • Environment:

       2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X

      Description

      We found that Word2Vec generate large absolute value vectors when numIterations are large, and if numIterations are large enough (>20), the vector's value many be infinity(or -infinity)**, resulting in useless vectors.

      In normal situations, vectors values are mainly around -1.0~1.0 when numIterations = 1.

      The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X

      There are already issues report this bug: https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works seems missing.

      Other people's reports:

      https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec

      http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html

      =======================================================

      Here are the code to reproduce the issue. You can download title.akas.tsv from https://datasets.imdbws.com/ and upload to hdfs.

       

      import org.apache.spark.sql.SparkSession
      import org.apache.spark.ml.feature.Word2Vec
      
      case class Sentences(name: String, words: Array[String])
      
      import spark.implicits._
      
      // IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/
      val dataset = spark.read
        .option("header", "true").option("sep", "\t")
        .option("quote", "").option("nullValue", "\\N")
        .csv("/tmp/word2vec/title.akas.tsv")
        .filter("region = 'US' or language = 'en'")
        .select("title")
        .as[String]
        .map(s => Sentences(s, s.split(' ')))
        .persist()
      
      println("Training model...")
      val word2Vec = new Word2Vec()
        .setInputCol("words")
        .setOutputCol("vector")
        .setVectorSize(64)
        .setWindowSize(4)
        .setNumPartitions(50)
        .setMinCount(5)
        .setMaxIter(20)
      val model = word2Vec.fit(dataset)
      
      model.getVectors.show()
      

      When set maxIter to 30, you will get the result.

      scala> model.getVectors.show()
      +-------------+--------------------+
      |         word|              vector|
      +-------------+--------------------+
      |     Unspoken|[-Infinity,-Infin...|
      |       Talent|[Infinity,-Infini...|
      |    Hourglass|[1.09657520526310...|
      |Nickelodeon's|[2.20436549446219...|
      |      Priests|[-1.9625896848389...|
      |    Religion:|[-3.8815759928213...|
      |           Bu|[-7.9722236466752...|
      |      Totoro:|[-4.1829056206528...|
      |     Trouble,|[2.51985378203136...|
      |       Hatter|[8.49108115961009...|
      |          '79|[-5.4560309784650...|
      |         Vile|[-1.2059769646379...|
      |         9/11|[Infinity,-Infini...|
      |      Santino|[6.30405421282099...|
      |      Motives|[1.96207712570869...|
      |          '13|[-1.7641987324084...|
      |       Fierce|[-Infinity,Infini...|
      |       Stover|[5.10057474120744...|
      |          'It|[1.08629989605664...|
      |        Butts|[Infinity,Infinit...|
      +-------------+--------------------+
      only showing top 20 rows
      

      In this case, set maxIter to 20 may not generate Infinity but very large absolute values. It depends on the training data sample and other configurations.

      scala> model.getVectors.show(2,false)
      +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      |word    |vector                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
      +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      |Unspoken|[-8.345756381631837E26,-4.521902763541592E26,-2.3382486258889084E27,-1.0244081299466769E27,-2.0078509112460803E27,-1.6760533100889865E27,-2.582670788770659E27,-3.38100521565687E26,1.7553847873565714E27,-1.170131062449021E27,-1.6565472801835883E27,-1.5594244347657445E27,-2.5150639513558596E26,1.949539129915606E27,-7.580918216717454E26,1.2361994783015613E27,-3.152053008864166E27,-8.185652662597534E26,-5.4443628225426E25,2.245579525466733E26,-1.97655047590181E27,2.8597275293150673E26,-1.1006336920210832E27,1.6166580407985987E27,1.5272882143409825E26,-1.0115330404529906E27,-1.8895683222101184E27,2.6156506156954E27,-1.698058504881491E27,-1.5132098806248563E27,3.7327358519511804E27,1.3356636582642166E27,2.3614379909704805E26,8.96912646624494E26,1.5518857669716535E27,-3.05221863964144E27,4.399680909202177E26,-2.607914789100649E27,-1.4080384994067242E27,2.7666078487221474E27,6.946950108699123E26,-1.1122679059344192E27,-2.3621557537823886E27,9.433206702172274E26,-2.3704690372536228E27,2.5086034219659006E27,2.0173186657484236E27,-1.8448836672357273E27,-1.5081404202054957E27,2.641836064055936E26,-5.613083015733733E26,-2.1296579720982533E26,-1.6550184140347592E27,-1.9152898718506886E27,1.25699596863538E27,-2.0774912070471012E27,-1.5454685136432914E27,-2.479843324641509E27,1.5560216745669318E27,-2.2176656540799786E27,-9.628781296451031E26,1.3663974096305426E27,1.6326327735924786E27,-1.9533865304335714E27]|
      |Talent  |[1.3996313289146157E31,-2.216329024373106E31,1.0729251707928603E31,-4.007120754159977E31,-7.217488429248302E30,3.579654497535965E31,2.7979270365837212E31,4.333613174196825E31,3.2947832174019738E31,-1.770444782887265E31,-1.1996572271408077E31,1.9686960444755403E31,-5.211369239778517E31,4.559579301984929E31,8.789691017490939E30,-3.3896103915518896E31,-2.842517781869879E31,3.653230690058367E31,1.6690004323711066E31,-1.1803405268246773E31,4.577673536512265E31,3.9686553942166427E31,-2.0779652882517364E31,9.553626958941078E29,-1.1967228014988571E31,2.667234660143298E31,-5.082234231802067E29,-5.053934698852727E31,2.911363689445293E31,4.57440169967406E31,2.296044625777839E31,3.4719839372636273E31,-4.753091634806606E30,-2.2139650908254315E31,5.747913246328898E31,-4.027332301367786E31,-3.3981312029599884E30,-3.235915541756495E31,-3.690297564613571E31,3.6645060993927487E31,2.32138854666024E31,-4.79833731565554E31,2.4538652976104142E31,4.91394707312416E30,2.2888500664401483E31,8.433142525511996E30,-2.3447174299865074E31,-3.9894235308718024E31,1.6571656530599892E31,3.743449438983912E31,5.619889452742693E31,2.0932366809902723E31,-2.2306515916821173E30,-4.2788883664425833E30,-8.754273117753689E30,-3.8767150140313846E30,-3.7649840346087072E31,-3.604430948638639E31,5.083292737026576E31,2.92915351645125E31,5.971055806972711E31,1.4773152095869043E31,5.12252479772471E31,3.035571146004139E31]                     |
      +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      only showing top 2 rows
      
      

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                viirya L. C. Hsieh
                Reporter:
                zhongyu09 Yu Zhong
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: