Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27107

Spark SQL Job failing because of Kryo buffer overflow with ORC

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.2, 2.4.0
    • 2.4.1, 3.0.0
    • SQL
    • None

    Description

      The issue occurs while trying to read ORC data and setting the SearchArgument.

       Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 9
      Serialization trace:
      literalList (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
      leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
      	at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
      	at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
      	at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
      	at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
      	at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
      	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
      	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
      	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
      	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
      	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
      	at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
      	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
      	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
      	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
      	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
      	at org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
      	at org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
      	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
      	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
      	at scala.Option.foreach(Option.scala:257)
      	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
      	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
      	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
      	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
      	at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
      	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
      	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
      	at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:128)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119)
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
      	... 52 more
      

      This happens only with the new apache orc based implementation and doesn't happen with the hive based implementation. 

       

      Reason:

      Hive implementation (1.2) sets the default buffer size to 4K (edit: corrected from 4M to 4K) and max buffer size to 10M.

      https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgumentImpl.java#L998

      Orc implementation on the other hand, sets the size to 100K.
      https://github.com/apache/orc/blob/master/java/mapreduce/src/java/org/apache/orc/mapred/OrcInputFormat.java#L93
       

      We need to fix this in the ORC library and update the version in spark to resolve the issue.

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dongjoon Dongjoon Hyun Assign to me
            Dhruve Ashar Dhruve Ashar
            Votes:
            0 Vote for this issue
            Watchers:
            5 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment