Description
In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB scale. Initially I got the following exception (as FileScanRDD has been made the default in master branch)
16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0..... java.lang.IllegalArgumentException: Field "s_store_sk" does not exist. at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) at scala.collection.AbstractMap.getOrElse(Map.scala:59) at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235) at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at org.apache.spark.sql.types.StructType.map(StructType.scala:94) at org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410) at org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157) at org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361)
When running with "spark.sql.sources.fileScan=false", following exception is thrown
16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist. at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) at scala.collection.AbstractMap.getOrElse(Map.scala:59) at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235) at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at org.apache.spark.sql.types.StructType.map(StructType.scala:94) at org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410) at org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317) at org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$16.apply(DataSourceStrategy.scala:229) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$16.apply(DataSourceStrategy.scala:228) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:537) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:536) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:625) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:532) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:224) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:147) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
TPC-DS dataset generator generates column names differently & maintains the mapping in hive metastore. This mapping is somehow broken in master causing these exceptions.
e.g
Structure for /apps/hive/warehouse/tpcds_bin_partitioned_orc_200.db/catalog_returns/cr_returned_date_sk=2451916/000019_0 Type: struct<_col0:int,_col1:int,_col2:int,_col3:int,_col4:int,_col5:int,_col6:int,_col7:int,_col8:int,_col9:int,_col10:int,_col11:int,_col12:int,_col13:int,_col14:int,_col15:bigint,_col16:int,_col17:float,_col18:float,_col19:float,_col20:float,_col21:float,_col22:float,_col23:float,_col24:float,_col25:float>
Creating this ticket as this used to work in earlier branches.
Attachments
Issue Links
- blocks
-
SPARK-20901 Feature parity for ORC with Parquet
- Open
- is duplicated by
-
SPARK-15757 Error occurs when using Spark sql "select" statement on orc file after hive sql "insert overwrite tb1 select * from sourcTb" has been executed on this orc file
- Resolved
-
SPARK-16605 Spark2.0 cannot "select" data from a table stored as an orc file which has been created by hive while hive or spark1.6 supports
- Closed
- relates to
-
SPARK-16628 OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files
- Resolved
- links to