Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13909

DataFrames DISK_ONLY persistence leads to OOME

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.6.0
    • None
    • Spark Core
    • debian:jessie, java 1.8, hadoop 2.6.0, current zeppelin snapshot

    Description

      Hey, I migrated to 1.6.0, and suddenly `persist` behaves as if it was `MEMORY_ONLY` instead of `DISK_ONLY` so that it eventually ends with OOME. However if I remove `persist` it works fine. I'm calling this snippet from Zeppelin notebook :

      val coreRdd = sc.textFile("s3n://gwiq-views-p/external/core/tsv/*.tsv").map(_.split("\t")).map( fields => Row(fields:_*) )
      val coreDataFrame = sqlContext.createDataFrame(coreRdd, schema)
      coreDataFrame.registerTempTable("core")
      coreDataFrame.persist(StorageLevel.DISK_ONLY)
      
      SELECT COUNT(*) FROM core
      
      ------ Create new SparkContext spark://master:7077 -------
      Exception in thread "pool-1-thread-5" java.lang.OutOfMemoryError: GC overhead limit exceeded
      	at com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:66)
      	at com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:69)
      	at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
      	at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:188)
      	at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:146)
      	at com.google.gson.Gson.fromJson(Gson.java:791)
      	at com.google.gson.Gson.fromJson(Gson.java:757)
      	at com.google.gson.Gson.fromJson(Gson.java:706)
      	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.convert(RemoteInterpreterServer.java:417)
      	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.getProgress(RemoteInterpreterServer.java:384)
      	at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1376)
      	at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1361)
      	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
      	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
      	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      

      I'm using this https://github.com/gettyimages/docker-spark setup with a zeppelin docker container...

          ZEPPELIN_JAVA_OPTS: -Dspark.executor.memory=16g -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.app.id=zeppelin
          SPARK_SUBMIT_OPTIONS: --driver-memory 1g --repositories https://oss.sonatype.org/content/repositories/snapshots --packages com.viagraphs:spark-extensions_2.10:1.04-SNAPSHOT --jars=file:/usr/spark-1.6.0-bin-hadoop2.6/lib/aws-java-sdk-1.7.14.jar,file:/usr/spark-1.6.0-bin-hadoop2.6/lib/hadoop-aws-2.6.0.jar,file:/usr/spark-1.6.0-bin-hadoop2.6/lib/google-collections-1.0.jar,file:/usr/spark-1.6.0-bin-hadoop2.6/lib/joda-time-2.8.2.jar
          SPARK_WORKER_CORES: 8
          SPARK_WORKER_MEMORY: 16g
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            l154k Jakub Liska
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: