Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40253

Data read exception in orc format

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 2.4.3
    • None
    • SQL
    • None
    • os centos7

      spark 2.4.3

      hive 1.2.1

      hadoop 2.7.2

    • Patch, Important

    Description

      Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 limit: 0

      When running batches using spark-sql and using the create table xxx as select syntax, the select query part uses a static value as the default value (0.00 as column_name) and does not specify the data type of the default value. In this usage scenario, because the data type is not explicitly specified, the metadata information of the field in the written ORC file is missing (the writing is successful), but when reading, as long as the query column contains this field, it will not be able to Parsing the ORC file, the following error occurs:

       

      create table testgg as select 0.00 as gg;select * from testgg;Caused by: java.io.IOException: Error reading file: viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-00000-e7df51a1-98b9-4472-9899-3c132b97885b-c000       at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)       at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)       at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)       at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)       at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)       at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)       at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)       at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)       at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)       at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)       at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)       at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)       at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)       at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)       at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)       at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at org.apache.spark.scheduler.Task.run(Task.scala:121)       at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)       at java.lang.Thread.run(Thread.java:748)Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 limit: 0       at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)       at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)       at org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)       at org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1205)       at org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1279)       at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2012)       at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1284)       ... 25 more
       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            yo8237233 yihangqiao
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified