[SPARK-39393] Parquet data source only supports push-down predicate filters for non-repeated primitive types - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

Delete

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1
Fix Version/s: 3.1.3, 3.3.0, 3.2.2, 3.4.0
Component/s: SQL
Labels:
- parquet

Description

I use an example to illustrate the problem. The reason for the problem and the problem-solving approach are stated below.

Assume follow Protocol buffer schema:

message Model {
     string name = 1;
     repeated string keywords = 2;
}

Suppose a parquet file is created from a set of records in the above format with the help of the parquet-protobuf library.

Using Spark version 3.0.2 or older, we could run the following query using spark-shell:

val data = spark.read.parquet("/path/to/parquet")
data.registerTempTable("models")
spark.sql("select * from models where array_contains(keywords, 'X')").show(false)

But after updating Spark, we get the following error:

Caused by: java.lang.IllegalArgumentException: FilterPredicates do not currently support repeated columns. Column keywords is repeated.
  at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:176)
  at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:149)
  at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:89)
  at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56)
  at org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:192)
  at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:61)
  at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:95)
  at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:45)
  at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149)
  at org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:72)
  at org.apache.parquet.hadoop.ParquetFileReader.filterRowGroups(ParquetFileReader.java:870)
  at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:789)
  at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657)
  at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
  at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:373)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
...

At first it seems the problem is the parquet library. But in fact, our problem is because of this line that has been around since 2014 (based on Git history):

Parquet Schema Compatibility Validator

After some check, I notice that the cause of the problem is due to a change in the data filtering conditions:

spark.sql("select * from log where array_contains(keywords, 'X')").explain(true);

// Spark 3.0.2 and older
== Physical Plan ==
... 
+- FileScan parquet [link#0,keywords#1]
  DataFilters: [array_contains(keywords#1, Google)]
  PushedFilters: []
  ...

// Spark 3.1.0 and newer
== Physical Plan == ... 
+- FileScan parquet [link#0,keywords#1]
  DataFilters: [isnotnull(keywords#1),  array_contains(keywords#1, Google)]
  PushedFilters: [IsNotNull(keywords)]
  ...

It's good that the filtering section has become smarter. Unfortunately, due to unfamiliarity with code base, I could not find the exact location of the change and related pull request. In general, this change is suitable for non-repeated parquet fields, but in the repeated field, it causes an error from the parquet library. (Like the example given)

The only temporary solution in my opinion to solve the problem is to disable the following setting, which in general greatly reduces performance:

SET spark.sql.parquet.filterPushdown=false

I created a patch for this bug and a pull request will be sent soon.

Attachments

Issue Links

Add Link

links to

[Github] Pull Request #36781 (Borjianamin98)

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Amin Borjian Assign to me

Reporter:: Amin Borjian

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Jun/22 19:49

Updated:: 09/Jun/22 08:42

Resolved:: 08/Jun/22 21:12

Agile

View on Board

Parquet data source only supports push-down predicate filters for non-repeated primitive types

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment