Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
None
-
None
Description
Use branch code(119f313) to test following pig script in spark mode:
A = load './SkewedJoinInput1.txt' as (id,name,n); B = load './SkewedJoinInput2.txt' as (id,name); D = join A by (id,name), B by (id,name); store D into './testFRJoin.out';
cat bin/SkewedJoinInput1.txt
100 apple1 aaa 200 orange1 bbb 300 strawberry ccc
cat bin/SkewedJoinInput2.txt
100 apple1 100 apple2 100 apple2 200 orange1 200 orange2 300 strawberry 400 pear
following exception found in log:
[dag-scheduler-event-loop] 2016-05-05 14:21:01,046 DEBUG rdd.NewHadoopRDD (Logging.scala:logDebug(84)) - Failed to use InputSplit#getLocationInfo. java.lang.NullPointerException at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114) at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.rdd.HadoopRDD$.convertSplitLocationInfo(HadoopRDD.scala:406) at org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231) at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1387) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1397) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations will call PigSplit#getLocationInfo but currently PigSplit extends InputSplit and InputSplit#getLocationInfo return null.
@Evolving public SplitLocationInfo[] getLocationInfo() throws IOException { return null; }
Attachments
Attachments
Issue Links
- links to