[SPARK-19430] Cannot read external tables with VARCHAR columns if they're backed by ORC files written by Hive 1.2.1 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.6.3, 2.0.2, 2.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Spark throws an exception when trying to read external tables with VARCHAR columns if they're backed by ORC files that were written by Hive 1.2.1 (and possibly other versions of hive).

Steps to reproduce (credits to lian cheng):

Write an ORC table using Hive 1.2.1 with

CREATE TABLE orc_varchar_test STORED AS ORC
AS SELECT CASTE('a' AS VARCHAR(10)) AS c0

Get the raw path of the written ORC file

Create an external table pointing to this file and read the table using Spark

val path = "/tmp/orc_varchar_test"
sql(s"create external table if not exists test (c0 varchar(10)) stored as orc location '$path'")
spark.table("test").show()

The problem here is that the metadata in the ORC file written by Hive is different from those written by Spark. We can inspect the ORC file written above:

$ hive --orcfiledump file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
Structure for file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
File Version: 0.12 with HIVE_8732
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:varchar(10)>       <----
...

On the other hand, if you create an ORC table using the same DDL and inspect the written ORC file, you'll see:

...
Type: struct<c0:string>
...

Note that all tests are done with spark.sql.hive.convertMetastoreOrc set to false, which is the default case.

I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of the following error:

java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to org.apache.hadoop.io.Text
    at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
    at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
    at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Attachments

Issue Links

blocks

SPARK-20901 Feature parity for ORC with Parquet

Open

duplicates

SPARK-19459 ORC tables cannot be read when they contain char/varchar columns

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Sameer Agarwal

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Feb/17 20:40

Updated:: 09/Oct/17 20:14

Resolved:: 09/Oct/17 20:14