Running the following code to store data from each year and pos in a seperate folder for a very large dataframe is taking a huge amount of time. (>37 hours for 60% of the work)
Currently, the code is at:
[Stage 12:==============================> (1367 + 30) / 2290]
And it has already been more than 37 hours. A single sweep on this data for filter by value takes less than 6.5 minutes.
The spark web interface shows the following lines for the 2 stages of the job:
Stage Description Submitted Duration Tasks:succeeded/total Input Output Shuffle Read Shuffle Write
11 load at NativeMethodAccessorImpl.java:-2 +details 2016/02/27 23:07:04 6.5 min 2290/2290 66.8 GB
12 save at NativeMethodAccessorImpl.java:-2 +details 2016/02/27 23:15:59 37.1 h 1370/2290 40.9 GB