Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8707

RDD#toDebugString fails if any cached RDD has invalid partitions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4.0, 1.4.1
    • 1.6.0
    • Spark Core

    Description

      Repro:

      sc.textFile("/ThisFileDoesNotExist").cache()
      sc.parallelize(0 until 100).toDebugString
      

      Output:

      java.io.IOException: Not a file: /ThisFileDoesNotExist
      	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
      	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
      	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
      	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
      	at scala.Option.getOrElse(Option.scala:120)
      	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
      	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
      	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
      	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
      	at scala.Option.getOrElse(Option.scala:120)
      	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
      	at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59)
      	at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
      	at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
      	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
      	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
      	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
      	at org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455)
      	at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573)
      	at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607)
      	at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637
      

      This is because toDebugString gets all the partitions from all RDDs, which fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be resilient to other RDDs being invalid (and getRDDStorageInfo should probably also be).

      Attachments

        Activity

          People

            navis Navis Ryu
            ilikerps Aaron Davidson
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: