Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28371

Parquet "starts with" filter is not null-safe

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 2.4.4, 3.0.0
    • SQL
    • None

    Description

      I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 has the same behavior in a few places but Spark somehow doesn't trigger those code paths.

      Basically, UserDefinedPredicate.keep should be null-safe, and Spark's implementation is not. This was clarified in Parquet's documentation in PARQUET-1489.

      Failure I was getting:

      Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor driver): java.lang.NullPointerException

        at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)

        at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)

        at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)

        at org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)

        at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)

        at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)

        at org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)

        at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)

        at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)

        at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)

        at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)

        at org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)

        at org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)

        at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)

        at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)

        at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439)

        ... 
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            vanzin Marcelo Masiero Vanzin
            vanzin Marcelo Masiero Vanzin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment