Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5571

LDA should handle text as well

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.3.0
    • None
    • MLlib

    Description

      Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings).

      This plan is sketched in the original LDA design doc.

      There should be:

      • runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary.
      • dictionary parameter for when LDA is run with word count vectors
      • prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: