Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22329

Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Won't Fix
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical issue.

      • SPARK-19611 uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files.

        This situation will occur for any Hive table that wasn't created by Spark or that was created prior to Spark 2.1.0. If a user attempts to run a query over such a table containing a case-sensitive field name in the query projection or in the query filter, the query will return 0 results in every case.

      • However, SPARK-22306 reports this also corrupts Hive Metastore schema by removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner.
      • Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look okay at least. However, we need to figure out the issue of changing owners. Also, we cannot backport bucketing patch into `branch-2.2`. We need more tests on before releasing 2.3.0.

      Hive Metastore is a shared resource and Spark should not corrupt it by default. This issue proposes to recover that option back to `NEVER_INFO` like Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by themselves.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                dongjoon Dongjoon Hyun
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: