Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27107

Spark SQL Job failing because of Kryo buffer overflow with ORC

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.2, 2.4.0
    • 2.4.1, 3.0.0
    • SQL
    • None

    Description

      The issue occurs while trying to read ORC data and setting the SearchArgument.

       Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 9
      Serialization trace:
      literalList (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
      leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
      	at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
      	at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
      	at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
      	at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
      	at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
      	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
      	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
      	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
      	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
      	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
      	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
      	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
      	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
      	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
      	at org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
      	at org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
      	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
      	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
      	at scala.Option.foreach(Option.scala:257)
      	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
      	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
      	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
      	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
      	at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
      	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
      	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
      	at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:128)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119)
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
      	... 52 more
      

      This happens only with the new apache orc based implementation and doesn't happen with the hive based implementation. 

       

      Reason:

      Hive implementation (1.2) sets the default buffer size to 4K (edit: corrected from 4M to 4K) and max buffer size to 10M.

      https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgumentImpl.java#L998

      Orc implementation on the other hand, sets the size to 100K.
      https://github.com/apache/orc/blob/master/java/mapreduce/src/java/org/apache/orc/mapred/OrcInputFormat.java#L93
       

      We need to fix this in the ORC library and update the version in spark to resolve the issue.

       

      Attachments

        Issue Links

          Activity

            People

              dongjoon Dongjoon Hyun
              Dhruve Ashar Dhruve Ashar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: