Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14387

Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0, 2.1.1, 2.2.0
    • 2.2.1, 2.3.0
    • SQL
    • None

    Description

      In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB scale. Initially I got the following exception (as FileScanRDD has been made the default in master branch)

      16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0..... java.lang.IllegalArgumentException: Field "s_store_sk" does not exist.
      at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
      at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
      at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
      at scala.collection.AbstractMap.getOrElse(Map.scala:59)
      at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
      at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
      at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
      at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      at scala.collection.Iterator$class.foreach(Iterator.scala:893)
      at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
      at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
      at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
      at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
      at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
      at org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
      at org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157)
      at org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146)
      at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69)
      at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60)
      at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
      at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      at org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361)
      

      When running with "spark.sql.sources.fileScan=false", following exception is thrown

      16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING,
      java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist.
              at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
              at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
              at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
              at scala.collection.AbstractMap.getOrElse(Map.scala:59)
              at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
              at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
              at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
              at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
              at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
              at scala.collection.Iterator$class.foreach(Iterator.scala:893)
              at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
              at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
              at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
              at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
              at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
              at org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
              at org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317)
              at org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$16.apply(DataSourceStrategy.scala:229)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$16.apply(DataSourceStrategy.scala:228)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:537)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:536)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:625)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:532)
              at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:224)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
              at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
              at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
              at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:147)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
              at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
              at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
              at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      

      TPC-DS dataset generator generates column names differently & maintains the mapping in hive metastore. This mapping is somehow broken in master causing these exceptions.
      e.g

      Structure for /apps/hive/warehouse/tpcds_bin_partitioned_orc_200.db/catalog_returns/cr_returned_date_sk=2451916/000019_0
      Type: struct<_col0:int,_col1:int,_col2:int,_col3:int,_col4:int,_col5:int,_col6:int,_col7:int,_col8:int,_col9:int,_col10:int,_col11:int,_col12:int,_col13:int,_col14:int,_col15:bigint,_col16:int,_col17:float,_col18:float,_col19:float,_col20:float,_col21:float,_col22:float,_col23:float,_col24:float,_col25:float>
      

      Creating this ticket as this used to work in earlier branches.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dongjoon Dongjoon Hyun Assign to me
            rajesh.balamohan Rajesh Balamohan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment