Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22457

Tables are supposed to be MANAGED only taking into account whether a path is provided

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • SQL

    Description

      As far as I know, since Spark 2.2, tables are supposed to be MANAGED only taking into account whether a path is provided:

      val tableType = if (storage.locationUri.isDefined) {
            CatalogTableType.EXTERNAL
          } else {
            CatalogTableType.MANAGED
          }
      

      This solution seems to be right for filesystem based data sources. On the other hand, when working with other data sources such as elasticsearch, that solution is leading to a weird behaviour described below:

      1) InMemoryCatalog's doCreateTable() adds a locationURI if CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.

      2) Before loading the data source table FindDataSourceTable's readDataSourceTable() adds a path option if locationURI exists:

      val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_))
      

      3) That causes an error when reading from elasticsearch because 'path' is an option already supported by elasticsearch (locationUri is set to file:/home/user/spark-rv/elasticsearch/shop/clients)

      org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required before using Spark SQL

      Would be possible only to mark tables as MANAGED for a subset of data sources (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?

      P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which from my point of view should only be required for filesystem based data sources:

             if (tableMeta.tableType == CatalogTableType.MANAGED)
             ...
             // Delete the data/directory of the table
              val dir = new Path(tableMeta.location)
              try {
                val fs = dir.getFileSystem(hadoopConfig)
                fs.delete(dir, true)
              } catch {
                case e: IOException =>
                  throw new SparkException(s"Unable to drop table $table as failed " +
                    s"to delete its directory $dir", e)
              }
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            darroyocazorla David Arroyo
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: