Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Duplicate
-
1.4.0, 1.4.1
-
None
Description
After upgrading to spark 1.4.x, locality seems to be entirely broken for NewHadoopRDD with a spark cluster that is co-located with an HDFS cluster. Whereas an identical job run in spark 1.2.x or 1.3.x for us would run all partitions with locality level NODE_LOCAL, after upgrading to 1.4.x the locality level switched to ANY for all partitions.
Furthermore it appears to be somehow launching the tasks in order of their locations or something to that effect because there are hotspots of 1 node at a time with completely maxed resources during the read. To test this theory i wrote a job that scans for all the files in the driver, parallelizes the list and then loads the files back through the hadoop API in a mapPartitions function (which correct me if i'm wrong but this should be identical to using ANY locality?) and the result was that my hack was 4x faster than letting spark parse the files itself!
As for performance effect, this has caused a 12x slowdown for us from 1.3.1 to 1.4.1. Needless to say we have downgraded back for now and everything appears to work normally again now.
We were able to reproduce this behavior on multiple clusters and also on both hadoop 2.4 and hadoop 2.6 (I saw that there were 2 different code paths depending on the existence of of hadoop 2.6 for figuring out preferred locations). The only thing that has fixed the problem for us is to downgrade back to 1.3.1.
Not sure how helpful it will be but through reflection i checked the results of calling on the RDD the getPreferredLocations method and it returned me an empty List on both 1.3.1 where it works and on 1.4.1 where it doesn't. I also tried called the function getPreferredLocs on the spark context with the RDD and that actually properly gave me back the 3 locations of the partition i passed it in both 1.3.1 and 1.4.1. So as far as i can tell the logic for getPreferredLocs and getPreferredLocations seems to match across versions and it appears to be that the use of this information in the scheduler is what must have changed. However I could not find many references to either of these 2 functions so I was not able to debug much further.
Attachments
Issue Links
- duplicates
-
SPARK-10149 Locality Level is ANY on "Details for Stage" WebUI page
- Resolved