Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40280

Failure to create parquet predicate push down for ints and longs on some valid files

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0, 3.2.0, 3.3.0, 3.4.0
    • 3.3.1, 3.2.3, 3.4.0
    • SQL
    • None

    Description

      The parquet format specification states that...

      INT(8, true), INT(16, true), and INT(32, true) must annotate an int32 primitive type and INT(64, true) must annotate an int64 primitive type. INT(32, true) and INT(64, true) are implied by the int32 and int64 primitive types if no other annotation is present and should be considered optional.

      But the code inside of ParquetFilters.scala requires that for int32 and int64 that there be no annotation. If there is an annotation for those columns and they are a part of a predicate push down, the hard coded types will not match and the corresponding filter ends up being None.

      This can be a huge performance penalty for a valid parquet file.

      I am happy to provide files that show the issue if needed for testing.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            revans2 Robert Joseph Evans
            revans2 Robert Joseph Evans
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment