Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17213

Parquet String Pushdown for Non-Eq Comparisons Broken

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0
    • 2.1.0
    • SQL
    • None

    Description

      Spark defines ordering over strings based on comparison of UTF8 byte arrays, which compare bytes as unsigned integers. Currently however Parquet does not respect this ordering. This is currently in the process of being fixed in Parquet, JIRA and PR link below, but currently all filters are broken over strings, with there actually being a correctness issue for > and <.

      Repro:
      Querying directly from in-memory DataFrame:

          > Seq("a", "é").toDF("name").where("name > 'a'").count
          1
      

      Querying from a parquet dataset:

          > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
          > spark.read.parquet("/tmp/bad").where("name > 'a'").count
          0
      

      This happens because Spark sorts the rows to be [a, é], but Parquet's implementation of comparison of strings is based on signed byte array comparison, so it will actually create 1 row group with statistics min=é,max=a, and so the row group will be dropped by the query.

      Based on the way Parquet pushes down Eq, it will not be affecting correctness but it will force you to read row groups you should be able to skip.

      Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
      Link to PR: https://github.com/apache/parquet-mr/pull/362

      Attachments

        Issue Links

          Activity

            People

              lian cheng Cheng Lian
              andreweduffy Andrew Duffy
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: