Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
From the user list:
I was trying to determine effect of changing JoinStrategy on a Spark pipeline. I noticed that my pipeline works fine with DefaultJoinStrategy, however I could not get it to working with MapSideJoinStrategy and BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1] on driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of the stages. I have not tried to do any configuration changes but I did run tests with datasets of different sizes to ensure that my PCollection is small enough to fit in memory. I am running spark in yarn-client mode with Crunch 0.11.0-cdh5.4.2.
[1] https://gist.github.com/anonymous/15d6c691b743ad392d42
[2] https://gist.github.com/anonymous/b02a82401a30a69f1cff
The bug is in the SparkRuntime.distributeFiles method, which needs to include a scheme for the URI it's handing to Spark.