Affects Version/s: 2.3.0
Fix Version/s: None
Java 8, Scala 2.11.8, Spark 2.3.0, sbt 0.13.16
I am trying to do few (union + reduceByKey) operations on a hiearchical dataset in a iterative fashion in rdd. The first few loops run fine but on the subsequent loops, the operations ends up using the whole scratch space provided to it.
I have set the scratch directory, i.e. SPARK_LOCAL_DIRS , to be one having 100 GB space.
The heirarchical dataset, whose size is (< 400kB), remains constant throughout the iterations.
I have tried the worker cleanup flag but it has no effect i.e. "spark.worker.cleanup.enabled=true"
What I am trying to do (High Level):
I have a dataset of 5 different csv ( Parent, Child1, Child2, Child21, Child22 ) which are related in a hierarchical fashion as shown below.
Parent-> Child1 -> Child2 -> Child21
Parent-> Child1 -> Child2 -> Child22
Each element in the tree has 14 columns (elementid, parentelement_id, cat1, cat2, num1, num2,....., num10)
I am trying to aggregate the values of one column of Child21 into Child1 (i.e. 2 levels up). I am doing the same for another column value of Child22 into Child1. Then I am merging these aggregated values at the same Child1 level.
This is present in the code at location :
Code which replicates the issue:
Steps to reproduce the issue :
1] Clone the above repository.
2] Put the csvs in the "issue-data" folder in the above repository at a hadoop location "hdfs:///tree/dummy/data/"
3] Set the spark scratch directory (SPARK_LOCAL_DIRS) to a folder which has large space. (> 100 GB)
4] Run "sbt assembly"
5] Run the following command at the project location :
--class spark.rddexample.dummyrdd.FunctionExecutor \
--master local \
--deploy-mode client \
--executor-memory 2G \
--driver-memory 2G \