[SPARK-40253] Data read exception in orc format - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 2.4.3
Fix Version/s: None
Component/s: SQL
Labels:
None
Environment:

os centos7

spark 2.4.3

hive 1.2.1

hadoop 2.7.2

Flags:

Patch, Important

Description

Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 limit: 0

When running batches using spark-sql and using the create table xxx as select syntax, the select query part uses a static value as the default value (0.00 as column_name) and does not specify the data type of the default value. In this usage scenario, because the data type is not explicitly specified, the metadata information of the field in the written ORC file is missing (the writing is successful), but when reading, as long as the query column contains this field, it will not be able to Parsing the ORC file, the following error occurs：

create table testgg as select 0.00 as gg;select * from testgg;Caused by: java.io.IOException: Error reading file: viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-00000-e7df51a1-98b9-4472-9899-3c132b97885b-c000       at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)       at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)       at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)       at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)       at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)       at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)       at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)       at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)       at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)       at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)       at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)       at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)       at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)       at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)       at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)       at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at org.apache.spark.scheduler.Task.run(Task.scala:121)       at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)       at java.lang.Thread.run(Thread.java:748)Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 limit: 0       at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)       at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)       at org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)       at org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1205)       at org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1279)       at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2012)       at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1284)       ... 25 more

Attachments

Issue Links

Blocked

SPARK-26437 Decimal data becomes bigint to query, unable to query

Resolved

links to

[Github] Pull Request #37726 (SelfImpr001)

[Github] Pull Request #37732 (SelfImpr001)

Data read exception in orc format

Details

Description

Attachments

Issue Links

Activity

People

Dates

Time Tracking