Description
I'm seeing the following exception when reading old ORC data with Iceberg
0.0 in stage 0.0 (TID 0, executor 1): java.lang.IllegalArgumentException: No conversion of type INT to self needed at org.apache.iceberg.shaded.org.apache.orc.impl.ConvertTreeReaderFactory.createAnyIntegerConvertTreeReader(ConvertTreeReaderFactory.java:1659) at org.apache.iceberg.shaded.org.apache.orc.impl.ConvertTreeReaderFactory.createConvertTreeReader(ConvertTreeReaderFactory.java:2112) at org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2327) at org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1957) at org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2367) at org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1957) at org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2367) at org.apache.iceberg.shaded.org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:230) at org.apache.iceberg.shaded.org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:741) at org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:87) at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:72) at org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:470) at org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:422) at org.apache.iceberg.spark.source.Reader$TaskDataReader.<init>(Reader.java:356) at org.apache.iceberg.spark.source.Reader$ReadTask.createPartitionReader(Reader.java:305) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
I think the problem lies in the following snippet in method org.apache.orc.impl.TreeReaderFactory#createTreeReader
if (!fileType.equals(readerType) && ... // elided)) { ... }
We are doing an equals comparison on the TypeDescription class. This equals comparison can now fail for at least 2 reasons
- Reader schema has annotations [properties] and old file schema does not
- Reader schema field name does not match in case with the file schema. This, I suspect, is because the old data was written by Hive.
At least 1 can be fixed if we change
fileType.equals(readerType) => fileType.getCategory().equals(readerType.getCategory())
I'm currently unsure of the repercussions of this so haven't made this change myself.
Attachments
Issue Links
- links to