Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2199

Distributed probabilistic latent semantic analysis in MLlib

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:

      Description

      Probabilistic latent semantic analysis (PLSA) is a topic model which extracts topics from text corpus. PLSA was historically a predecessor of LDA. However recent research shows that modifications of PLSA sometimes performs better then LDA[1]. Furthermore, the most recent paper by same authors shows that there is a clear way to extend PLSA to LDA and beyond[2].

      We should implement distributed versions of PLSA. In addition it should be possible to easily add user defined regularizers or combination of them. We will implement regularizers that allows

      • extract sparse topics
      • extract human interpretable topics
      • perform semi-supervised training
      • sort out non-topic specific terms.

      [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In Proceedings of ECIR'13.
      [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                acopich Valeriy Avanesov
                Reporter:
                turdakov Denis Turdakov
              • Votes:
                4 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: