Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19623

Take rows from DataFrame with empty first partition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Cannot Reproduce
    • 1.6.2
    • 2.1.0
    • Spark Core
    • None

    Description

      I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions having a empty first partition, DataFrame and its RDD have different behaviors during taking rows from it. If we take only 1000 rows from DataFrame, it causes OOME but RDD is OK.
      In detail,
      DataFrame without a empty first partition => OK
      DataFrame with a empty first partition => OOME
      RDD of DataFrame with a empty first partition => OK
      Codes below reproduce this error.

      import org.apache.spark.sql._
      import org.apache.spark.sql.types._
      val rdd = sc.parallelize(1 to 100000000,1000).map(i => Row.fromSeq(Array.fill(100)(i)))
      val schema = StructType(for(i <- 1 to 100) yield {
      StructField("COL"+i,IntegerType, true)
      })
      val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) Iterator[Row]() else iter)
      val df1 = sqlContext.createDataFrame(rdd,schema)
      df1.take(1000) // OK
      val df2 = sqlContext.createDataFrame(rdd2,schema)
      df2.rdd.take(1000) // OK
      df2.take(1000) // OOME
      

      I tested it on Spark 1.6.2 with 2gb of driver memory and 5gb of executor memory.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jb jung Jaeboo Jung
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: