Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21158

SparkSQL function SparkSession.Catalog.ListTables() does not handle spark setting for case-sensitivity

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.0.0
    • None
    • SQL
    • Windows 10
      IntelliJ
      Scala

    Description

      When working with SQL table names in Spark SQL we have noticed some issues with case-sensitivity.

      If you set spark.sql.caseSensitive setting to be true, SparkSQL stores the table names in the way it was provided. This is correct.

      If you set spark.sql.caseSensitive setting to be false, SparkSQL stores the table names in lower case.

      Then, we use the function sqlContext.tableNames() to get all the tables in our DB. We check if this list contains(<"string of table name">) to determine if we have already created a table. If case-sensitivity is turned off (false), this function should look if the table name is contained in the table list regardless of case.

      However, it tries to look for only ones that match the lower case version of the stored table. Therefore, if you pass in a camel or upper case table name, this function would return false when in fact the table does exist.

      The root cause of this issue is in the function SparkSession.Catalog.ListTables()

      For example:
      In your SQL context - you have four tables and you have chosen to have spark.sql.case-Sensitive=false so it stores your tables in lowercase:
      carnames
      carmodels
      carnamesandmodels
      users
      dealerlocations

      When running your pipeline, you want to see if you have already created the temp join table of 'carnamesandmodels'. However, you have stored it as a constant which reads: CarNamesAndModels for readability.

      So you can use the function
      sqlContext.tableNames().contains("CarNamesAndModels").
      This should return true - because we know its already created, but it will currently return false since CarNamesAndModels is not in lowercase.

      The responsibility to change the name passed into the .contains method to be lowercase should not be put on the spark user. This should be done by spark sql if case-sensitivity is turned to false.

      Proposed solutions:

      • Setting case sensitive in the sql context should make the sql context be agnostic to case but not change the storage of the table
      • There should be a custom contains method for ListTables() which converts the tablename to be lowercase before checking
      • SparkSession.Catalog.ListTables() should return the list of tables in the input format instead of in all lowercase.

      Attachments

        Activity

          People

            Unassigned Unassigned
            kamcclin Kathryn McClintic
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified