Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5560

LDA EM should scale to more iterations

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.0
    • 1.4.0
    • MLlib
    • None

    Description

      Latent Dirichlet Allocation (LDA) sometimes fails to run for many iterations on large datasets, even when it is able to run for a few iterations. It should be able to run for as many iterations as the user likes, with proper persistence and checkpointing.

      Here is an example from a test on 16 workers (EC2 r3.2xlarge) on a big Wikipedia dataset:

      • 100 topics
      • Training set size: 4072243 documents
      • Vocabulary size: 9869422 terms
      • Training set size: 1041734290 tokens

      It runs for about 10-15 iterations before failing, even when using a variety of checkpointInterval values and longer timeout settings (up to 5 minutes). The failure varies from disconnections from workers/driver to workers running out of disk space. I would not expect workers to run out of memory or disk space based on rough calculations. There was some job imbalance, but not a significant amount.

      Attachments

        Issue Links

          Activity

            People

              josephkb Joseph K. Bradley
              josephkb Joseph K. Bradley
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 336h
                  336h
                  Remaining:
                  Remaining Estimate - 336h
                  336h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified