Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48423

Unable to write MLPipeline to blob storage using .option attribute

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • 3.4.3
    • None
    • ML, MLlib, Spark Core
    • None

    Description

      I am trying to write mllib pipeline with a series of stages set in it to azure blob storage giving relevant write parameters, but it still complains of `fs.azure.account.key` not being found in the configuration.

      Sharing the code.

      val spark = SparkSession.builder().appName("main").master("local[4]").getOrCreate()
      
      import spark.implicits._
      
      val df = spark.createDataFrame(Seq(
        (0L, "a b c d e spark"),
        (1L, "b d")
      )).toDF("id", "text") 
      
      val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text")
      val pipelinee = new Pipeline().setStages(Array(si))
      val pipelineModel = pipelinee.fit(df)
      val path = BLOB_STORAGE_PATH
      
      pipelineModel.write
      .option("spark.hadoop.fs.azure.account.key.<account_name>.dfs.core.windows.net",  "__").option("fs.azure.account.key.<account_name>.dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.endpoint.<account_name>.dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net", "__").option("fs.azure.account.auth.type.<account_name>.dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret.<account_name>.dfs.core.windows.net", "__").option("fs.azure.account.oauth.provider.type.<account_name>.dfs.core.windows.net", "__")
      .save(path)

       

      The error that i get is 

       Failure to initialize configuration
      Caused by: InvalidConfigurationValueException: Invalid configuration value detected for fs.azure.account.key
      at org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
          at org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548)
          at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449)

      This shows that even though the key,value of 

      spark.hadoop.fs.azure.account.key.<account_name>.dfs.core.windows.net 

      is being sent via option param, but is not being set internally.

       

      while this works only if i explicitly set the values in the

      spark.conf.set(key,value) 

      which might be problematic for a multi-tenant solution, which can be using the same spark context.

      one other observation is 

      df.write.option(key1,value1).option(key2,value2).save(path)  

      fails with same key error while,

      map = Map(key1->value1, key2->value2)  
      df.write.options(map).save(path) 

      works..

       

      Help required on: Similar to how dataframes `options`

      df.write.options(Map<key,value>) 

       helps to set the configuration, the .option(key1, value1) should also work to write to azure blob storage.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            chhavibansal Chhavi Bansal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: