Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27107

Spark SQL Job failing because of Kryo buffer overflow with ORC

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.2, 2.4.0
    • Fix Version/s: 2.4.1, 3.0.0
    • Component/s: SQL
    • Labels:
      None

      Description

      The issue occurs while trying to read ORC data and setting the SearchArgument.

       Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 9
      Serialization trace:
      literalList (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
      leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
      	at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
      	at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
      	at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
      	at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
      	at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
      	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
      	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
      	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
      	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
      	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
      	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
      	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
      	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
      	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
      	at org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
      	at org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
      	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
      	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
      	at scala.Option.foreach(Option.scala:257)
      	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
      	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
      	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
      	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
      	at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
      	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
      	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
      	at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:128)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119)
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
      	... 52 more
      

      This happens only with the new apache orc based implementation and doesn't happen with the hive based implementation. 

       

      Reason:

      Hive implementation (1.2) sets the default buffer size to 4K (edit: corrected from 4M to 4K) and max buffer size to 10M.

      https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgumentImpl.java#L998

      Orc implementation on the other hand, sets the size to 100K.
      https://github.com/apache/orc/blob/master/java/mapreduce/src/java/org/apache/orc/mapred/OrcInputFormat.java#L93
       

      We need to fix this in the ORC library and update the version in spark to resolve the issue.

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dongjoon Dongjoon Hyun
                Reporter:
                Dhruve Ashar Dhruve Ashar
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: