Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4886

Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • spark-branch
    • spark
    • None

    Description

      Use branch code(119f313) to test following pig script in spark mode:

      A = load './SkewedJoinInput1.txt' as (id,name,n);
      B = load './SkewedJoinInput2.txt' as (id,name);
      D = join A by (id,name), B by (id,name);
      store D into './testFRJoin.out';
      

      cat bin/SkewedJoinInput1.txt

      100	apple1	aaa
      200	orange1	bbb
      300	strawberry	ccc
      

      cat bin/SkewedJoinInput2.txt

      100	apple1
      100	apple2
      100	apple2
      200	orange1
      200	orange2
      300	strawberry
      400	pear
      

      following exception found in log:

      [dag-scheduler-event-loop] 2016-05-05 14:21:01,046 DEBUG rdd.NewHadoopRDD (Logging.scala:logDebug(84)) - Failed to use InputSplit#getLocationInfo.
      java.lang.NullPointerException
              at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
              at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
              at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
              at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
              at org.apache.spark.rdd.HadoopRDD$.convertSplitLocationInfo(HadoopRDD.scala:406)
              at org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:202)
              at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
              at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
              at scala.Option.getOrElse(Option.scala:120)
              at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230)
              at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1387)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1397)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
              at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
      

      org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations will call PigSplit#getLocationInfo but currently PigSplit extends InputSplit and InputSplit#getLocationInfo return null.

        @Evolving
        public SplitLocationInfo[] getLocationInfo() throws IOException {
          return null;
        }
      

      Attachments

        1. PIG-4886.patch
          3 kB
          liyunzhang

        Issue Links

          Activity

            People

              kellyzly liyunzhang
              kellyzly liyunzhang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: