Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5151

Parquet Predicate Pushdown Does Not Work with Nested Structures.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.2.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
    • Environment:

      pyspark, spark-ec2 created cluster

      Description

      I have json files of objects created with a nested structure roughly of the formof the form:
      { id: 123, event: "login", meta_data: {'user: "user1"}}
      ....
      { id: 125, event: "login", meta_data: {'user: "user2"}}

      I load the data via spark with

      rdd = sql_context.jsonFile()

      1. save it as a parquet file
        rdd.saveAsParquetFile()

      rdd = sql_context.parquetFile()
      rdd.registerTempTable('events')

      so if I run this query it works without issue if predicate pushdown is disabled

      select count(1) from events where meta_data.user = "user1"

      if I enable predicate pushdown I get an error saying meta_data.user is not in the schema

      Py4JJavaError: An error occurred while calling o218.collect.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 125 in stage 12.0 failed 4 times, most recent failure: Lost task 125.3 in stage 12.0 (TID 6164, ): java.lang.IllegalArgumentException: Column [user] was not found in schema!
      at parquet.Preconditions.checkArgument(Preconditions.java:47)
      at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
      at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
      at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
      at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
      at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
      at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
      .....

      I expect this is actually related to another bug I filed where nested structure is not preserved with spark sql.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                brdwrd Brad Willard
              • Votes:
                0 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: