Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12488

LDA describeTopics() Generates Invalid Term IDs

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.2
    • Fix Version/s: 1.4.2, 1.5.3, 1.6.1, 2.0.0
    • Component/s: MLlib
    • Labels:
      None

      Description

      When running the LDA model, and using the describeTopics function, invalid values appear in the termID list that is returned:

      The below example generates 10 topics on a data set with a vocabulary of 685.

          // Set LDA parameters
          val numTopics = 10
          val lda = new LDA().setK(numTopics).setMaxIterations(10)
      
          val ldaModel = lda.run(docTermVector)
          val distModel = ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
      
      scala> ldaModel.describeTopics()(0)._1.sorted.reverse
      res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
      
      scala> ldaModel.describeTopics()(0)._1.sorted
      res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 4...
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mengxr Xiangrui Meng
                Reporter:
                ilganeli Ilya Ganelin
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: