[SPARK-24784] Retraining (each document as separate file) creates OOME - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.3.1
Fix Version/s: None
Component/s: Spark Core
Labels:
- bulk-closed
Environment:

Spark 2.3.1
JDK8
SBT 1.1.6
Ubuntu 18.04
Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz

Description

This looks like a bug. The reproducible can be found at: https://github.com/purijatin/spark-retrain-bug

Information also available at: http://apache-spark-developers-list.1001551.n3.nabble.com/Retraining-with-each-document-as-separate-file-creates-OOME-td24335.html

Issue:
The program takes input a set of individual files (json) and each file contains a single document with the same json schema as others.
The spark program computes the tf-idf of the documents (Tokenizer -> Stopword remover -> stemming -> tf -> tfidf).
Once the training is complete, it unpersists and restarts the operation again.

When run with `-Xmx1500M` memory, it fails with OOME after about 10 iterations.

Temporary Fix:
When all the input documents are merged to a single file, then the issue is no longer found.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Jatin Puri

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Jul/18 12:24

Updated:: 08/Oct/19 05:43

Resolved:: 08/Oct/19 05:43