Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21997

Spark shows different results on char/varchar columns on Parquet

    Details

    • Type: Bug
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.2, 2.1.1, 2.2.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different results according to the SQL configuration, spark.sql.hive.convertMetastoreParquet. We had better fix this. Actually, the default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is wrong by default.

      scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
      scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
      scala> sql("SELECT * FROM t_char").show
      +---+---+
      |  a|  b|
      +---+---+
      |  a|  b|
      +---+---+
      
      scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
      
      scala> sql("SELECT * FROM t_char").show
      +----------+---+
      |         a|  b|
      +----------+---+
      |a         |  b|
      +----------+---+
      

        Issue Links

          Activity

          Hide
          dongjoon Dongjoon Hyun added a comment -

          Hi, Xiao Li and Wenchen Fan.
          I'm wondering if this is designed like this because it seems to be a behavior since 2.0.
          Since this is a configuration issue, should we turn off `spark.sql.hive.convertMetastoreParquet` or make a fix for this?

          Show
          dongjoon Dongjoon Hyun added a comment - Hi, Xiao Li and Wenchen Fan . I'm wondering if this is designed like this because it seems to be a behavior since 2.0. Since this is a configuration issue, should we turn off `spark.sql.hive.convertMetastoreParquet` or make a fix for this?
          Hide
          dongjoon Dongjoon Hyun added a comment -

          I update the title to focus on Parquet first.

          Show
          dongjoon Dongjoon Hyun added a comment - I update the title to focus on Parquet first.
          Hide
          apachespark Apache Spark added a comment -

          User 'dongjoon-hyun' has created a pull request for this issue:
          https://github.com/apache/spark/pull/19235

          Show
          apachespark Apache Spark added a comment - User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19235

            People

            • Assignee:
              Unassigned
              Reporter:
              dongjoon Dongjoon Hyun
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development