Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4886

Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: spark-branch
    • Component/s: spark
    • Labels:
      None

      Description

      Use branch code(119f313) to test following pig script in spark mode:

      A = load './SkewedJoinInput1.txt' as (id,name,n);
      B = load './SkewedJoinInput2.txt' as (id,name);
      D = join A by (id,name), B by (id,name);
      store D into './testFRJoin.out';
      

      cat bin/SkewedJoinInput1.txt

      100	apple1	aaa
      200	orange1	bbb
      300	strawberry	ccc
      

      cat bin/SkewedJoinInput2.txt

      100	apple1
      100	apple2
      100	apple2
      200	orange1
      200	orange2
      300	strawberry
      400	pear
      

      following exception found in log:

      [dag-scheduler-event-loop] 2016-05-05 14:21:01,046 DEBUG rdd.NewHadoopRDD (Logging.scala:logDebug(84)) - Failed to use InputSplit#getLocationInfo.
      java.lang.NullPointerException
              at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
              at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
              at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
              at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
              at org.apache.spark.rdd.HadoopRDD$.convertSplitLocationInfo(HadoopRDD.scala:406)
              at org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:202)
              at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
              at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
              at scala.Option.getOrElse(Option.scala:120)
              at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230)
              at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1387)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1397)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
      

      org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations will call PigSplit#getLocationInfo but currently PigSplit extends InputSplit and InputSplit#getLocationInfo return null.

        @Evolving
        public SplitLocationInfo[] getLocationInfo() throws IOException {
          return null;
        }
      

        Attachments

        1. PIG-4886.patch
          3 kB
          liyunzhang

          Issue Links

            Activity

              People

              • Assignee:
                kellyzly liyunzhang
                Reporter:
                kellyzly liyunzhang
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: