Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings).
This plan is sketched in the original LDA design doc.
There should be:
- runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary.
- dictionary parameter for when LDA is run with word count vectors
- prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently)