Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.3.1
-
None
-
Spark 2.3.1
JDK8
SBT 1.1.6
Ubuntu 18.04
Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Description
This looks like a bug. The reproducible can be found at: https://github.com/purijatin/spark-retrain-bug
Information also available at: http://apache-spark-developers-list.1001551.n3.nabble.com/Retraining-with-each-document-as-separate-file-creates-OOME-td24335.html
Issue:
The program takes input a set of individual files (json) and each file contains a single document with the same json schema as others.
The spark program computes the tf-idf of the documents (Tokenizer -> Stopword remover -> stemming -> tf -> tfidf).
Once the training is complete, it unpersists and restarts the operation again.
When run with `-Xmx1500M` memory, it fails with OOME after about 10 iterations.
Temporary Fix:
When all the input documents are merged to a single file, then the issue is no longer found.