Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24666

Word2Vec generate infinity vectors when numIterations are large

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.3.1, 2.4.4
    • 2.4.5, 3.1.0
    • ML, MLlib
    • None
    •  2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X

    Description

      We found that Word2Vec generate large absolute value vectors when numIterations are large, and if numIterations are large enough (>20), the vector's value many be infinity(or -infinity)**, resulting in useless vectors.

      In normal situations, vectors values are mainly around -1.0~1.0 when numIterations = 1.

      The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X

      There are already issues report this bug: https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works seems missing.

      Other people's reports:

      https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec

      http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html

      =======================================================

      Here are the code to reproduce the issue. You can download title.akas.tsv from https://datasets.imdbws.com/ and upload to hdfs.

       

      import org.apache.spark.sql.SparkSession
      import org.apache.spark.ml.feature.Word2Vec
      
      case class Sentences(name: String, words: Array[String])
      
      import spark.implicits._
      
      // IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/
      val dataset = spark.read
        .option("header", "true").option("sep", "\t")
        .option("quote", "").option("nullValue", "\\N")
        .csv("/tmp/word2vec/title.akas.tsv")
        .filter("region = 'US' or language = 'en'")
        .select("title")
        .as[String]
        .map(s => Sentences(s, s.split(' ')))
        .persist()
      
      println("Training model...")
      val word2Vec = new Word2Vec()
        .setInputCol("words")
        .setOutputCol("vector")
        .setVectorSize(64)
        .setWindowSize(4)
        .setNumPartitions(50)
        .setMinCount(5)
        .setMaxIter(20)
      val model = word2Vec.fit(dataset)
      
      model.getVectors.show()
      

      When set maxIter to 30, you will get the result.

      scala> model.getVectors.show()
      +-------------+--------------------+
      |         word|              vector|
      +-------------+--------------------+
      |     Unspoken|[-Infinity,-Infin...|
      |       Talent|[Infinity,-Infini...|
      |    Hourglass|[1.09657520526310...|
      |Nickelodeon's|[2.20436549446219...|
      |      Priests|[-1.9625896848389...|
      |    Religion:|[-3.8815759928213...|
      |           Bu|[-7.9722236466752...|
      |      Totoro:|[-4.1829056206528...|
      |     Trouble,|[2.51985378203136...|
      |       Hatter|[8.49108115961009...|
      |          '79|[-5.4560309784650...|
      |         Vile|[-1.2059769646379...|
      |         9/11|[Infinity,-Infini...|
      |      Santino|[6.30405421282099...|
      |      Motives|[1.96207712570869...|
      |          '13|[-1.7641987324084...|
      |       Fierce|[-Infinity,Infini...|
      |       Stover|[5.10057474120744...|
      |          'It|[1.08629989605664...|
      |        Butts|[Infinity,Infinit...|
      +-------------+--------------------+
      only showing top 20 rows
      

      In this case, set maxIter to 20 may not generate Infinity but very large absolute values. It depends on the training data sample and other configurations.

      scala> model.getVectors.show(2,false)
      +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      |word    |vector                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
      +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      |Unspoken|[-8.345756381631837E26,-4.521902763541592E26,-2.3382486258889084E27,-1.0244081299466769E27,-2.0078509112460803E27,-1.6760533100889865E27,-2.582670788770659E27,-3.38100521565687E26,1.7553847873565714E27,-1.170131062449021E27,-1.6565472801835883E27,-1.5594244347657445E27,-2.5150639513558596E26,1.949539129915606E27,-7.580918216717454E26,1.2361994783015613E27,-3.152053008864166E27,-8.185652662597534E26,-5.4443628225426E25,2.245579525466733E26,-1.97655047590181E27,2.8597275293150673E26,-1.1006336920210832E27,1.6166580407985987E27,1.5272882143409825E26,-1.0115330404529906E27,-1.8895683222101184E27,2.6156506156954E27,-1.698058504881491E27,-1.5132098806248563E27,3.7327358519511804E27,1.3356636582642166E27,2.3614379909704805E26,8.96912646624494E26,1.5518857669716535E27,-3.05221863964144E27,4.399680909202177E26,-2.607914789100649E27,-1.4080384994067242E27,2.7666078487221474E27,6.946950108699123E26,-1.1122679059344192E27,-2.3621557537823886E27,9.433206702172274E26,-2.3704690372536228E27,2.5086034219659006E27,2.0173186657484236E27,-1.8448836672357273E27,-1.5081404202054957E27,2.641836064055936E26,-5.613083015733733E26,-2.1296579720982533E26,-1.6550184140347592E27,-1.9152898718506886E27,1.25699596863538E27,-2.0774912070471012E27,-1.5454685136432914E27,-2.479843324641509E27,1.5560216745669318E27,-2.2176656540799786E27,-9.628781296451031E26,1.3663974096305426E27,1.6326327735924786E27,-1.9533865304335714E27]|
      |Talent  |[1.3996313289146157E31,-2.216329024373106E31,1.0729251707928603E31,-4.007120754159977E31,-7.217488429248302E30,3.579654497535965E31,2.7979270365837212E31,4.333613174196825E31,3.2947832174019738E31,-1.770444782887265E31,-1.1996572271408077E31,1.9686960444755403E31,-5.211369239778517E31,4.559579301984929E31,8.789691017490939E30,-3.3896103915518896E31,-2.842517781869879E31,3.653230690058367E31,1.6690004323711066E31,-1.1803405268246773E31,4.577673536512265E31,3.9686553942166427E31,-2.0779652882517364E31,9.553626958941078E29,-1.1967228014988571E31,2.667234660143298E31,-5.082234231802067E29,-5.053934698852727E31,2.911363689445293E31,4.57440169967406E31,2.296044625777839E31,3.4719839372636273E31,-4.753091634806606E30,-2.2139650908254315E31,5.747913246328898E31,-4.027332301367786E31,-3.3981312029599884E30,-3.235915541756495E31,-3.690297564613571E31,3.6645060993927487E31,2.32138854666024E31,-4.79833731565554E31,2.4538652976104142E31,4.91394707312416E30,2.2888500664401483E31,8.433142525511996E30,-2.3447174299865074E31,-3.9894235308718024E31,1.6571656530599892E31,3.743449438983912E31,5.619889452742693E31,2.0932366809902723E31,-2.2306515916821173E30,-4.2788883664425833E30,-8.754273117753689E30,-3.8767150140313846E30,-3.7649840346087072E31,-3.604430948638639E31,5.083292737026576E31,2.92915351645125E31,5.971055806972711E31,1.4773152095869043E31,5.12252479772471E31,3.035571146004139E31]                     |
      +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      only showing top 2 rows
      
      

       

      Attachments

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              zhongyu09 Yu Zhong
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: