Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19430

Cannot read external tables with VARCHAR columns if they're backed by ORC files written by Hive 1.2.1

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.6.3, 2.0.2, 2.1.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      Spark throws an exception when trying to read external tables with VARCHAR columns if they're backed by ORC files that were written by Hive 1.2.1 (and possibly other versions of hive).

      Steps to reproduce (credits to Cheng Lian):

      1. Write an ORC table using Hive 1.2.1 with
        CREATE TABLE orc_varchar_test STORED AS ORC
        AS SELECT CASTE('a' AS VARCHAR(10)) AS c0
      2. Get the raw path of the written ORC file
      3. Create an external table pointing to this file and read the table using Spark
        val path = "/tmp/orc_varchar_test"
        sql(s"create external table if not exists test (c0 varchar(10)) stored as orc location '$path'")
        spark.table("test").show()

      The problem here is that the metadata in the ORC file written by Hive is different from those written by Spark. We can inspect the ORC file written above:

      $ hive --orcfiledump file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
      Structure for file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
      File Version: 0.12 with HIVE_8732
      Rows: 1
      Compression: ZLIB
      Compression size: 262144
      Type: struct<_col0:varchar(10)>       <----
      ...
      

      On the other hand, if you create an ORC table using the same DDL and inspect the written ORC file, you'll see:

      ...
      Type: struct<c0:string>
      ...
      

      Note that all tests are done with spark.sql.hive.convertMetastoreOrc set to false, which is the default case.

      I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of the following error:

      java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to org.apache.hadoop.io.Text
          at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
          at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
          at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
          at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
          at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
          at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
          at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
          at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
          at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
          at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
          at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
          at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
          at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
          at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
          at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
          at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
          at org.apache.spark.scheduler.Task.run(Task.scala:99)
          at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                sameerag Sameer Agarwal
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: