Latent Dirichlet Allocation (LDA) sometimes fails to run for many iterations on large datasets, even when it is able to run for a few iterations. It should be able to run for as many iterations as the user likes, with proper persistence and checkpointing.
Here is an example from a test on 16 workers (EC2 r3.2xlarge) on a big Wikipedia dataset:
- 100 topics
- Training set size: 4072243 documents
- Vocabulary size: 9869422 terms
- Training set size: 1041734290 tokens
It runs for about 10-15 iterations before failing, even when using a variety of checkpointInterval values and longer timeout settings (up to 5 minutes). The failure varies from disconnections from workers/driver to workers running out of disk space. I would not expect workers to run out of memory or disk space based on rough calculations. There was some job imbalance, but not a significant amount.