Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14387

Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0, 2.1.1, 2.2.0
    • 2.2.1, 2.3.0
    • SQL
    • None

    Description

      In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB scale. Initially I got the following exception (as FileScanRDD has been made the default in master branch)

      16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0..... java.lang.IllegalArgumentException: Field "s_store_sk" does not exist.
      at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
      at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
      at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
      at scala.collection.AbstractMap.getOrElse(Map.scala:59)
      at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
      at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
      at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
      at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      at scala.collection.Iterator$class.foreach(Iterator.scala:893)
      at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
      at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
      at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
      at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
      at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
      at org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
      at org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157)
      at org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146)
      at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69)
      at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60)
      at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
      at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      at org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361)
      

      When running with "spark.sql.sources.fileScan=false", following exception is thrown

      16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING,
      java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist.
              at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
              at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
              at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
              at scala.collection.AbstractMap.getOrElse(Map.scala:59)
              at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
              at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
              at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
              at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
              at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
              at scala.collection.Iterator$class.foreach(Iterator.scala:893)
              at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
              at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
              at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
              at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
              at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
              at org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
              at org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317)
              at org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$16.apply(DataSourceStrategy.scala:229)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$16.apply(DataSourceStrategy.scala:228)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:537)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:536)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:625)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:532)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:224)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
              at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
              at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
              at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:147)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
              at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
              at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      

      TPC-DS dataset generator generates column names differently & maintains the mapping in hive metastore. This mapping is somehow broken in master causing these exceptions.
      e.g

      Structure for /apps/hive/warehouse/tpcds_bin_partitioned_orc_200.db/catalog_returns/cr_returned_date_sk=2451916/000019_0
      Type: struct<_col0:int,_col1:int,_col2:int,_col3:int,_col4:int,_col5:int,_col6:int,_col7:int,_col8:int,_col9:int,_col10:int,_col11:int,_col12:int,_col13:int,_col14:int,_col15:bigint,_col16:int,_col17:float,_col18:float,_col19:float,_col20:float,_col21:float,_col22:float,_col23:float,_col24:float,_col25:float>
      

      Creating this ticket as this used to work in earlier branches.

      Attachments

        Issue Links

          Activity

            People

              dongjoon Dongjoon Hyun
              rajesh.balamohan Rajesh Balamohan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: