Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14962

spark.sql.orc.filterPushdown=true breaks DataFrame where functionality

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.2, 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: SQL
    • Labels:
      None

      Description

      When running spark-shell with the configuration "spark.sql.orc.filterPushdown=true", the DataFrame function where and filter have an error. In particular, "column is not null" fails.

      Example Code:

      import sqlContext.implicits._
      case class MyData(string_field: String, array_field: Seq[String])

      val myDataArray = Array(
      MyData("foo", Seq("bar")),
      MyData("foobar", null)
      )

      val myDataDF = sc.parallelize(myDataArray).toDF
      myDataDF.count // 2
      myDataDF.where("array_field is null").count // 1
      myDataDF.where("array_field is not null").count // 1

      myDataDF.write.format("orc").save("/tmp/mydata.orc")

      val myLoadedDataDF = sqlContext.read.format("orc").load("/tmp/mydata.orc")
      myLoadedDataDF.count // 2
      myLoadedDataDF.where("array_field is not null").count // 0 incorrect

        Issue Links

          Activity

          Hide
          hyukjin.kwon Hyukjin Kwon added a comment - - edited

          FWIW, I could not reproduce this in master branch. Let me try to find the related JIRA. I forgot to enable predicate pushdown. I can reproduce in the master. Let me look into this.

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - - edited FWIW, I could not reproduce this in master branch. Let me try to find the related JIRA. I forgot to enable predicate pushdown. I can reproduce in the master. Let me look into this.
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          I see. This was because ORC tries to apply a filter on the column which has the type ORC does not supprot (ArrayType).

          ORC supports String, Long, Double, Byte, Short, Integer, Float, DateWritable, HiveDecimal, HiveChar and HiveVarchar. This is apparently OK because it does not take a type as an argument (Hive 1.2.x) during building filters (SearchArgument) but this became problematic when tries to evaluate.

          Currently IsNull and IsNotNull can be built for ORC on all types in Spark-side but it does not filter correctly because stored statistics always produces null for not supported types, eg ArrayType in ORC-side. So, it is always true for IsNull which ends up always false for IsNotNull.

          This was prevented in Hive 1.3.x by forcing to give a type when building a filter (SearchArgument but Hive 1.2.x is not doing this.

          Let me submit a PR with a more clear description.

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - I see. This was because ORC tries to apply a filter on the column which has the type ORC does not supprot ( ArrayType ). ORC supports String , Long , Double , Byte , Short , Integer , Float , DateWritable , HiveDecimal , HiveChar and HiveVarchar . This is apparently OK because it does not take a type as an argument (Hive 1.2.x) during building filters ( SearchArgument ) but this became problematic when tries to evaluate. Currently IsNull and IsNotNull can be built for ORC on all types in Spark-side but it does not filter correctly because stored statistics always produces null for not supported types, eg ArrayType in ORC-side. So, it is always true for IsNull which ends up always false for IsNotNull . This was prevented in Hive 1.3.x by forcing to give a type when building a filter ( SearchArgument but Hive 1.2.x is not doing this. Let me submit a PR with a more clear description.
          Hide
          apachespark Apache Spark added a comment -

          User 'HyukjinKwon' has created a pull request for this issue:
          https://github.com/apache/spark/pull/12777

          Show
          apachespark Apache Spark added a comment - User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/12777
          Hide
          lian cheng Cheng Lian added a comment -

          Issue resolved by pull request 12777
          https://github.com/apache/spark/pull/12777

          Show
          lian cheng Cheng Lian added a comment - Issue resolved by pull request 12777 https://github.com/apache/spark/pull/12777

            People

            • Assignee:
              hyukjin.kwon Hyukjin Kwon
              Reporter:
              jdfinsf Justin Foster
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development