Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5314

java.lang.OutOfMemoryError in SparkSQL with GROUP BY

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.5.0
    • SQL
    • None

    Description

      I am running a SparkSQL GROUP BY query on a largish Parquet table (a few hundred million rows), weighing it at about 50GB. My cluster has 1.7 TB of RAM, so it should have more than plenty resources to cope with this query.

      WARN TaskSetManager: Lost task 279.0 in stage 22.0 (TID 1229, ds-model-w-21.c.eastern-gravity-771.internal): java.lang.OutOfMemoryError: GC overhead limit exceeded
      at scala.collection.SeqLike$class.distinct(SeqLike.scala:493)
      at scala.collection.AbstractSeq.distinct(Seq.scala:40)
      at org.apache.spark.sql.catalyst.expressions.Coalesce.resolved$lzycompute(nullFunctions.scala:33)
      at org.apache.spark.sql.catalyst.expressions.Coalesce.resolved(nullFunctions.scala:33)
      at org.apache.spark.sql.catalyst.expressions.Coalesce.dataType(nullFunctions.scala:37)
      at org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:100)
      at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:101)
      at org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50)
      at org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:81)
      at org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:571)
      at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)
      at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
      at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
      at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
      at org.apache.spark.scheduler.Task.run(Task.scala:56)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)

      Attachments

        Activity

          People

            marmbrus Michael Armbrust
            alexbaretta Alex Baretta
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: