Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25206

wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

    XMLWordPrintableJSON

    Details

      Description

      In current Spark 2.3.1, below query returns wrong data silently.

      spark.range(10).write.parquet("/tmp/data")
      sql("DROP TABLE t")
      sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
      
      scala> sql("select * from t where id > 0").show
      +---+
      | ID|
      +---+
      +---+
      
      

       

      Root Cause

      After deep dive, it has two issues, both are related to different letter cases between Hive metastore schema and parquet schema.

      1. Wrong column is pushdown.

      Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, but ID does not exist in /tmp/data (parquet is case sensitive, it has id actually).
      So no records are returned.

      Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue.

      2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false.

      SPARK-25132 addressed this issue already.

       

      The biggest difference is, in Spark 2.1, user will get Exception for the same query:

      Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!

      So they will know the issue and fix the query.

      But in Spark 2.3, user will get the wrong results sliently.

       

      To make the above query work, we need both SPARK-25132 and SPARK-24716.

       

      Yuming WangWenchen FanXiao Li, any thoughts? Should we backport it?

        Attachments

        1. image-2018-08-25-10-04-21-901.png
          56 kB
          yucai
        2. image-2018-08-25-09-54-53-219.png
          91 kB
          yucai
        3. image-2018-08-24-22-46-05-346.png
          152 kB
          yucai
        4. image-2018-08-24-22-34-11-539.png
          151 kB
          yucai
        5. image-2018-08-24-22-33-03-231.png
          151 kB
          yucai
        6. pr22183.png
          165 kB
          yucai
        7. image-2018-08-24-18-05-23-485.png
          143 kB
          yucai

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                yucai yucai
              • Votes:
                0 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: