Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-556

ConvertTreeReader can incorrectly be applied on columns of the same primitive type

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.6.0, 1.6.1
    • 1.6.3, 1.7.0
    • None
    • None

    Description

      I'm seeing the following exception when reading old ORC data with Iceberg

      0.0 in stage 0.0 (TID 0, executor 1): java.lang.IllegalArgumentException: No conversion of type INT to self needed
      	at org.apache.iceberg.shaded.org.apache.orc.impl.ConvertTreeReaderFactory.createAnyIntegerConvertTreeReader(ConvertTreeReaderFactory.java:1659)
      	at org.apache.iceberg.shaded.org.apache.orc.impl.ConvertTreeReaderFactory.createConvertTreeReader(ConvertTreeReaderFactory.java:2112)
      	at org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2327)
      	at org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1957)
      	at org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2367)
      	at org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1957)
      	at org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2367)
      	at org.apache.iceberg.shaded.org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:230)
      	at org.apache.iceberg.shaded.org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:741)
      	at org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:87)
      	at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:72)
      	at org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:470)
      	at org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:422)
      	at org.apache.iceberg.spark.source.Reader$TaskDataReader.<init>(Reader.java:356)
      	at org.apache.iceberg.spark.source.Reader$ReadTask.createPartitionReader(Reader.java:305)
      	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
      

      I think the problem lies in the following snippet in method org.apache.orc.impl.TreeReaderFactory#createTreeReader

      if (!fileType.equals(readerType) &&
          ... // elided)) {
            ...
      }
      

      We are doing an equals comparison on the TypeDescription class. This equals comparison can now fail for at least 2 reasons

      1. Reader schema has annotations [properties] and old file schema does not
      2. Reader schema field name does not match in case with the file schema. This, I suspect, is because the old data was written by Hive.

      At least 1 can be fixed if we change

      fileType.equals(readerType) => fileType.getCategory().equals(readerType.getCategory()) 
      

      I'm currently unsure of the repercussions of this so haven't made this change myself.

      Attachments

        Issue Links

          Activity

            People

              shardulm Shardul Mahadik
              rdsr Ratandeep Ratti
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m