Details
Description
We found that Word2Vec generate large absolute value vectors when numIterations are large, and if numIterations are large enough (>20), the vector's value many be infinity(or -infinity)**, resulting in useless vectors.
In normal situations, vectors values are mainly around -1.0~1.0 when numIterations = 1.
The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
There are already issues report this bug: https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works seems missing.
Other people's reports:
https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec
=======================================================
Here are the code to reproduce the issue. You can download title.akas.tsv from https://datasets.imdbws.com/ and upload to hdfs.
import org.apache.spark.sql.SparkSession import org.apache.spark.ml.feature.Word2Vec case class Sentences(name: String, words: Array[String]) import spark.implicits._ // IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/ val dataset = spark.read .option("header", "true").option("sep", "\t") .option("quote", "").option("nullValue", "\\N") .csv("/tmp/word2vec/title.akas.tsv") .filter("region = 'US' or language = 'en'") .select("title") .as[String] .map(s => Sentences(s, s.split(' '))) .persist() println("Training model...") val word2Vec = new Word2Vec() .setInputCol("words") .setOutputCol("vector") .setVectorSize(64) .setWindowSize(4) .setNumPartitions(50) .setMinCount(5) .setMaxIter(20) val model = word2Vec.fit(dataset) model.getVectors.show()
When set maxIter to 30, you will get the result.
scala> model.getVectors.show() +-------------+--------------------+ | word| vector| +-------------+--------------------+ | Unspoken|[-Infinity,-Infin...| | Talent|[Infinity,-Infini...| | Hourglass|[1.09657520526310...| |Nickelodeon's|[2.20436549446219...| | Priests|[-1.9625896848389...| | Religion:|[-3.8815759928213...| | Bu|[-7.9722236466752...| | Totoro:|[-4.1829056206528...| | Trouble,|[2.51985378203136...| | Hatter|[8.49108115961009...| | '79|[-5.4560309784650...| | Vile|[-1.2059769646379...| | 9/11|[Infinity,-Infini...| | Santino|[6.30405421282099...| | Motives|[1.96207712570869...| | '13|[-1.7641987324084...| | Fierce|[-Infinity,Infini...| | Stover|[5.10057474120744...| | 'It|[1.08629989605664...| | Butts|[Infinity,Infinit...| +-------------+--------------------+ only showing top 20 rows
In this case, set maxIter to 20 may not generate Infinity but very large absolute values. It depends on the training data sample and other configurations.
scala> model.getVectors.show(2,false)
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|word |vector |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Unspoken|[-8.345756381631837E26,-4.521902763541592E26,-2.3382486258889084E27,-1.0244081299466769E27,-2.0078509112460803E27,-1.6760533100889865E27,-2.582670788770659E27,-3.38100521565687E26,1.7553847873565714E27,-1.170131062449021E27,-1.6565472801835883E27,-1.5594244347657445E27,-2.5150639513558596E26,1.949539129915606E27,-7.580918216717454E26,1.2361994783015613E27,-3.152053008864166E27,-8.185652662597534E26,-5.4443628225426E25,2.245579525466733E26,-1.97655047590181E27,2.8597275293150673E26,-1.1006336920210832E27,1.6166580407985987E27,1.5272882143409825E26,-1.0115330404529906E27,-1.8895683222101184E27,2.6156506156954E27,-1.698058504881491E27,-1.5132098806248563E27,3.7327358519511804E27,1.3356636582642166E27,2.3614379909704805E26,8.96912646624494E26,1.5518857669716535E27,-3.05221863964144E27,4.399680909202177E26,-2.607914789100649E27,-1.4080384994067242E27,2.7666078487221474E27,6.946950108699123E26,-1.1122679059344192E27,-2.3621557537823886E27,9.433206702172274E26,-2.3704690372536228E27,2.5086034219659006E27,2.0173186657484236E27,-1.8448836672357273E27,-1.5081404202054957E27,2.641836064055936E26,-5.613083015733733E26,-2.1296579720982533E26,-1.6550184140347592E27,-1.9152898718506886E27,1.25699596863538E27,-2.0774912070471012E27,-1.5454685136432914E27,-2.479843324641509E27,1.5560216745669318E27,-2.2176656540799786E27,-9.628781296451031E26,1.3663974096305426E27,1.6326327735924786E27,-1.9533865304335714E27]|
|Talent |[1.3996313289146157E31,-2.216329024373106E31,1.0729251707928603E31,-4.007120754159977E31,-7.217488429248302E30,3.579654497535965E31,2.7979270365837212E31,4.333613174196825E31,3.2947832174019738E31,-1.770444782887265E31,-1.1996572271408077E31,1.9686960444755403E31,-5.211369239778517E31,4.559579301984929E31,8.789691017490939E30,-3.3896103915518896E31,-2.842517781869879E31,3.653230690058367E31,1.6690004323711066E31,-1.1803405268246773E31,4.577673536512265E31,3.9686553942166427E31,-2.0779652882517364E31,9.553626958941078E29,-1.1967228014988571E31,2.667234660143298E31,-5.082234231802067E29,-5.053934698852727E31,2.911363689445293E31,4.57440169967406E31,2.296044625777839E31,3.4719839372636273E31,-4.753091634806606E30,-2.2139650908254315E31,5.747913246328898E31,-4.027332301367786E31,-3.3981312029599884E30,-3.235915541756495E31,-3.690297564613571E31,3.6645060993927487E31,2.32138854666024E31,-4.79833731565554E31,2.4538652976104142E31,4.91394707312416E30,2.2888500664401483E31,8.433142525511996E30,-2.3447174299865074E31,-3.9894235308718024E31,1.6571656530599892E31,3.743449438983912E31,5.619889452742693E31,2.0932366809902723E31,-2.2306515916821173E30,-4.2788883664425833E30,-8.754273117753689E30,-3.8767150140313846E30,-3.7649840346087072E31,-3.604430948638639E31,5.083292737026576E31,2.92915351645125E31,5.971055806972711E31,1.4773152095869043E31,5.12252479772471E31,3.035571146004139E31] |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 2 rows
Attachments
Issue Links
- links to