Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16628

OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2.1, 2.3.0
    • Component/s: SQL
    • Labels:
      None

      Description

      When spark.sql.hive.convertMetastoreOrc is enabled, we will convert a ORC table represented by a MetastoreRelation to HadoopFsRelation that uses Spark's OrcFileFormat internally. This conversion aims to make table scanning have a better performance since at runtime, the code path to scan HadoopFsRelation's performance is better. However, OrcFileFormat's implementation is based on the assumption that ORC files store their schema with correct column names. However, before Hive 2.0, an ORC table created by Hive does not store column name correctly in the ORC files (HIVE-4243). So, for this kind of ORC datasets, we cannot really convert the code path.

      Right now, if ORC tables are created by Hive 1.x or 0.x, enabling spark.sql.hive.convertMetastoreOrc will introduce a runtime exception for non-partitioned ORC tables and drop the metastore schema for partitioned ORC tables.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              dongjoon Dongjoon Hyun
              Reporter:
              yhuai Yin Huai

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment