Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5463 Improve Parquet support (reliability, performance, and error messages)
  3. SPARK-6554

Cannot use partition columns in where clause when Parquet filter push-down is enabled

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.3.0
    • 1.3.1, 1.4.0
    • SQL
    • None

    Description

      I'm having trouble referencing partition columns in my queries with Parquet.

      In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this:

      /mydata
          /probeTypeId=1
              ...files...
          /probeTypeId=2
              ...files...
      

      I see the column when I reference load a DF using the /mydata directory and call df.printSchema():

       |-- probeTypeId: integer (nullable = true)
      

      Parquet is also aware of the column:

           optional int32 probeTypeId;
      

      And this works fine:

      sqlContext.sql("select probeTypeId from df limit 1");
      

      ...as does df.show() - it shows the correct values for the partition column.

      However, when I try to use a partition column in a where clause, I get an exception stating that the column was not found in the schema:

      sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
      ...
      ...
      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema!
      	at parquet.Preconditions.checkArgument(Preconditions.java:47)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
      	at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
      	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
      	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
      	at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
      	at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
      	at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
      ...
      ...
      

      Here's the full stack trace:

      using local[*] for master
      06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
      06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
      06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]
      06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
      06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO
      06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]
      06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
      06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current configuration as safe fallback point
      
      INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
      WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      INFO  org.apache.spark.SecurityManager Changing view acls to: jon
      INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
      INFO  org.apache.spark.SecurityManager SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jon); users with modify permissions: Set(jon)
      INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
      INFO  Remoting Starting remoting
      INFO  Remoting Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.134:62493]
      INFO  org.apache.spark.util.Utils Successfully started service 'sparkDriver' on port 62493.
      INFO  org.apache.spark.SparkEnv Registering MapOutputTracker
      INFO  org.apache.spark.SparkEnv Registering BlockManagerMaster
      INFO  o.a.spark.storage.DiskBlockManager Created local directory at /var/folders/x7/9hdp8kw9569864088tsl4jmm0000gn/T/spark-150e23b2-ff19-4a51-8cfc-25fb8e1b3f2b/blockmgr-6eea286c-7473-4bda-8886-7250156b68f4
      INFO  org.apache.spark.storage.MemoryStore MemoryStore started with capacity 1966.1 MB
      INFO  org.apache.spark.HttpFileServer HTTP File server directory is /var/folders/x7/9hdp8kw9569864088tsl4jmm0000gn/T/spark-cf4687bd-1563-4ddf-b697-21c96fd95561/httpd-6343b9c9-bb66-43ac-ac43-6da80c7a1f95
      INFO  org.apache.spark.HttpServer Starting HTTP Server
      INFO  o.spark-project.jetty.server.Server jetty-8.y.z-SNAPSHOT
      INFO  o.s.jetty.server.AbstractConnector Started SocketConnector@0.0.0.0:62494
      INFO  org.apache.spark.util.Utils Successfully started service 'HTTP file server' on port 62494.
      INFO  org.apache.spark.SparkEnv Registering OutputCommitCoordinator
      INFO  o.spark-project.jetty.server.Server jetty-8.y.z-SNAPSHOT
      INFO  o.s.jetty.server.AbstractConnector Started SelectChannelConnector@0.0.0.0:4040
      INFO  org.apache.spark.util.Utils Successfully started service 'SparkUI' on port 4040.
      INFO  org.apache.spark.ui.SparkUI Started SparkUI at http://192.168.1.134:4040
      INFO  org.apache.spark.executor.Executor Starting executor ID <driver> on host localhost
      INFO  org.apache.spark.util.AkkaUtils Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@192.168.1.134:62493/user/HeartbeatReceiver
      INFO  o.a.s.n.n.NettyBlockTransferService Server created on 62495
      INFO  o.a.spark.storage.BlockManagerMaster Trying to register BlockManager
      INFO  o.a.s.s.BlockManagerMasterActor Registering block manager localhost:62495 with 1966.1 MB RAM, BlockManagerId(<driver>, localhost, 62495)
      INFO  o.a.spark.storage.BlockManagerMaster Registered BlockManager
      INFO  o.a.h.conf.Configuration.deprecation mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
      INFO  o.a.h.conf.Configuration.deprecation mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
      INFO  o.a.h.conf.Configuration.deprecation mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
      INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
      INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
      INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
      INFO  o.a.h.conf.Configuration.deprecation mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
      INFO  o.a.h.conf.Configuration.deprecation mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
      INFO  o.a.h.hive.metastore.HiveMetaStore 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
      INFO  o.a.h.hive.metastore.ObjectStore ObjectStore, initialize called
      INFO  DataNucleus.Persistence Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
      INFO  DataNucleus.Persistence Property datanucleus.cache.level2 unknown - will be ignored
      INFO  o.a.h.hive.metastore.ObjectStore Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
      INFO  o.a.h.h.metastore.MetaStoreDirectSql MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5.  Encountered: "@" (64), after : "".
      INFO  DataNucleus.Datastore The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
      INFO  DataNucleus.Datastore The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
      INFO  DataNucleus.Datastore The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
      INFO  DataNucleus.Datastore The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
      INFO  DataNucleus.Query Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
      INFO  o.a.h.hive.metastore.ObjectStore Initialized ObjectStore
      INFO  o.a.h.hive.metastore.HiveMetaStore Added admin role in metastore
      INFO  o.a.h.hive.metastore.HiveMetaStore Added public role in metastore
      INFO  o.a.h.hive.metastore.HiveMetaStore No user is added in admin role, since config is empty
      INFO  o.a.h.hive.ql.session.SessionState No Tez session required at this point. hive.execution.engine=mr.
      SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
      SLF4J: Defaulting to no-operation (NOP) logger implementation
      SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
      root
       |-- clientMarketId: integer (nullable = true)
       |-- clientCountryId: integer (nullable = true)
       |-- clientRegionId: integer (nullable = true)
       |-- clientStateId: integer (nullable = true)
       |-- clientAsnId: integer (nullable = true)
       |-- reporterZoneId: integer (nullable = true)
       |-- reporterCustomerId: integer (nullable = true)
       |-- responseCode: integer (nullable = true)
       |-- measurementValue: integer (nullable = true)
       |-- year: integer (nullable = true)
       |-- month: integer (nullable = true)
       |-- day: integer (nullable = true)
       |-- providerOwnerZoneId: integer (nullable = true)
       |-- providerOwnerCustomerId: integer (nullable = true)
       |-- providerId: integer (nullable = true)
       |-- probeTypeId: integer (nullable = true)
      
      ======================================================
      INFO  hive.ql.parse.ParseDriver Parsing command: select probeTypeId from df where probeTypeId = 1 limit 1
      INFO  hive.ql.parse.ParseDriver Parse Completed
      ==== results for select probeTypeId from df where probeTypeId = 1 limit 1
      ======================================================
      INFO  o.a.s.sql.parquet.ParquetRelation2 Reading 33.33333333333333% of partitions
      INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(191336) called with curMem=0, maxMem=2061647216
      INFO  org.apache.spark.storage.MemoryStore Block broadcast_0 stored as values in memory (estimated size 186.9 KB, free 1966.0 MB)
      INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(27530) called with curMem=191336, maxMem=2061647216
      INFO  org.apache.spark.storage.MemoryStore Block broadcast_0_piece0 stored as bytes in memory (estimated size 26.9 KB, free 1965.9 MB)
      INFO  o.a.spark.storage.BlockManagerInfo Added broadcast_0_piece0 in memory on localhost:62495 (size: 26.9 KB, free: 1966.1 MB)
      INFO  o.a.spark.storage.BlockManagerMaster Updated info of block broadcast_0_piece0
      INFO  org.apache.spark.SparkContext Created broadcast 0 from NewHadoopRDD at newParquet.scala:447
      INFO  o.a.h.conf.Configuration.deprecation mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
      INFO  o.a.h.conf.Configuration.deprecation mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
      INFO  o.a.s.s.p.ParquetRelation2$$anon$1$$anon$2 Using Task Side Metadata Split Strategy
      INFO  org.apache.spark.SparkContext Starting job: runJob at SparkPlan.scala:121
      INFO  o.a.spark.scheduler.DAGScheduler Got job 0 (runJob at SparkPlan.scala:121) with 1 output partitions (allowLocal=false)
      INFO  o.a.spark.scheduler.DAGScheduler Final stage: Stage 0(runJob at SparkPlan.scala:121)
      INFO  o.a.spark.scheduler.DAGScheduler Parents of final stage: List()
      INFO  o.a.spark.scheduler.DAGScheduler Missing parents: List()
      INFO  o.a.spark.scheduler.DAGScheduler Submitting Stage 0 (MapPartitionsRDD[3] at map at SparkPlan.scala:96), which has no missing parents
      INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(5512) called with curMem=218866, maxMem=2061647216
      INFO  org.apache.spark.storage.MemoryStore Block broadcast_1 stored as values in memory (estimated size 5.4 KB, free 1965.9 MB)
      INFO  org.apache.spark.storage.MemoryStore ensureFreeSpace(3754) called with curMem=224378, maxMem=2061647216
      INFO  org.apache.spark.storage.MemoryStore Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.7 KB, free 1965.9 MB)
      INFO  o.a.spark.storage.BlockManagerInfo Added broadcast_1_piece0 in memory on localhost:62495 (size: 3.7 KB, free: 1966.1 MB)
      INFO  o.a.spark.storage.BlockManagerMaster Updated info of block broadcast_1_piece0
      INFO  org.apache.spark.SparkContext Created broadcast 1 from broadcast at DAGScheduler.scala:839
      INFO  o.a.spark.scheduler.DAGScheduler Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3] at map at SparkPlan.scala:96)
      INFO  o.a.s.scheduler.TaskSchedulerImpl Adding task set 0.0 with 1 tasks
      INFO  o.a.spark.scheduler.TaskSetManager Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1687 bytes)
      INFO  org.apache.spark.executor.Executor Running task 0.0 in stage 0.0 (TID 0)
      INFO  o.a.s.s.p.ParquetRelation2$$anon$1 Input split: ParquetInputSplit{part: file:/Users/jon/Downloads/sparksql/1partitionsminusgeo/year=2015/month=1/day=14/providerOwnerZoneId=0/providerOwnerCustomerId=0/providerId=287/probeTypeId=1/part-r-00001.parquet start: 0 end: 8851183 length: 8851183 hosts: [] requestedSchema: message root {
        optional int32 probeTypeId;
      }
       readSupportMetadata: {org.apache.spark.sql.parquet.row.requested_schema={"type":"struct","fields":[{"name":"probeTypeId","type":"integer","nullable":true,"metadata":{}}]}, org.apache.spark.sql.parquet.row.metadata={"type":"struct","fields":[{"name":"clientMarketId","type":"integer","nullable":true,"metadata":{}},{"name":"clientCountryId","type":"integer","nullable":true,"metadata":{}},{"name":"clientRegionId","type":"integer","nullable":true,"metadata":{}},{"name":"clientStateId","type":"integer","nullable":true,"metadata":{}},{"name":"clientAsnId","type":"integer","nullable":true,"metadata":{}},{"name":"reporterZoneId","type":"integer","nullable":true,"metadata":{}},{"name":"reporterCustomerId","type":"integer","nullable":true,"metadata":{}},{"name":"responseCode","type":"integer","nullable":true,"metadata":{}},{"name":"measurementValue","type":"integer","nullable":true,"metadata":{}},{"name":"year","type":"integer","nullable":true,"metadata":{}},{"name":"month","type":"integer","nullable":true,"metadata":{}},{"name":"day","type":"integer","nullable":true,"metadata":{}},{"name":"providerOwnerZoneId","type":"integer","nullable":true,"metadata":{}},{"name":"providerOwnerCustomerId","type":"integer","nullable":true,"metadata":{}},{"name":"providerId","type":"integer","nullable":true,"metadata":{}},{"name":"probeTypeId","type":"integer","nullable":true,"metadata":{}}]}}}
      ERROR org.apache.spark.executor.Executor Exception in task 0.0 in stage 0.0 (TID 0)
      java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema!
      	at parquet.Preconditions.checkArgument(Preconditions.java:47) ~[parquet-common-1.6.0rc3.jar:na]
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) ~[parquet-column-1.6.0rc3.jar:na]
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) ~[parquet-column-1.6.0rc3.jar:na]
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) ~[parquet-column-1.6.0rc3.jar:na]
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) ~[parquet-column-1.6.0rc3.jar:na]
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) ~[parquet-column-1.6.0rc3.jar:na]
      	at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) ~[parquet-column-1.6.0rc3.jar:na]
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) ~[parquet-column-1.6.0rc3.jar:na]
      	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) ~[parquet-hadoop-1.6.0rc3.jar:na]
      	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) ~[parquet-hadoop-1.6.0rc3.jar:na]
      	at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) ~[parquet-column-1.6.0rc3.jar:na]
      	at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) ~[parquet-hadoop-1.6.0rc3.jar:na]
      	at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) ~[parquet-hadoop-1.6.0rc3.jar:na]
      	at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) ~[parquet-hadoop-1.6.0rc3.jar:na]
      	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.scheduler.Task.run(Task.scala:64) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) ~[spark-core_2.10-1.3.0.jar:1.3.0]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_31]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_31]
      	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31]
      WARN  o.a.spark.scheduler.TaskSetManager Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema!
      	at parquet.Preconditions.checkArgument(Preconditions.java:47)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
      	at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
      	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
      	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
      	at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
      	at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
      	at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
      	at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
      	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
      	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
      	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      	at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
      	at org.apache.spark.scheduler.Task.run(Task.scala:64)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      
      ERROR o.a.spark.scheduler.TaskSetManager Task 0 in stage 0.0 failed 1 times; aborting job
      INFO  o.a.s.scheduler.TaskSchedulerImpl Removed TaskSet 0.0, whose tasks have all completed, from pool 
      INFO  o.a.s.scheduler.TaskSchedulerImpl Cancelling stage 0
      INFO  o.a.spark.scheduler.DAGScheduler Job 0 failed: runJob at SparkPlan.scala:121, took 0.132538 s
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/static,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/executors/json,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/executors,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/environment/json,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/environment,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/storage/json,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/storage,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages/json,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/stages,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
      INFO  o.s.j.server.handler.ContextHandler stopped o.s.j.s.ServletContextHandler{/jobs,null}
      INFO  org.apache.spark.ui.SparkUI Stopped Spark web UI at http://192.168.1.134:4040
      INFO  o.a.spark.scheduler.DAGScheduler Stopping DAGScheduler
      INFO  o.a.s.MapOutputTrackerMasterActor MapOutputTrackerActor stopped!
      INFO  org.apache.spark.storage.MemoryStore MemoryStore cleared
      INFO  o.apache.spark.storage.BlockManager BlockManager stopped
      INFO  o.a.spark.storage.BlockManagerMaster BlockManagerMaster stopped
      INFO  o.a.s.s.OutputCommitCoordinator$OutputCommitCoordinatorActor OutputCommitCoordinator stopped!
      INFO  org.apache.spark.SparkContext Successfully stopped SparkContext
      
      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not found in schema!
      	at parquet.Preconditions.checkArgument(Preconditions.java:47)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
      	at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
      	at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
      	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
      	at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
      	at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
      	at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
      	at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
      	at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
      	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
      	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
      	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      	at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
      	at org.apache.spark.scheduler.Task.run(Task.scala:64)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      
      Driver stacktrace:
      	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
      	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
      	at scala.Option.foreach(Option.scala:236)
      	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
      	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
      	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
      	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      
      Mar 26, 2015 6:06:02 AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: eq(probeTypeId, 1)
      Mar 26, 2015 6:06:02 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
      Mar 26, 2015 6:06:02 AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: eq(probeTypeId, 1)
      
      Process finished with exit code 255
      

      Attachments

        Activity

          People

            lian cheng Cheng Lian
            jonchase Jon Chase
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: