[SPARK-48423] Unable to write MLPipeline to blob storage using .option attribute - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Blocker
Resolution: Unresolved
Affects Version/s: 3.4.3
Fix Version/s: None
Component/s: ML, MLlib, Spark Core
Labels:
None

Description

I am trying to write mllib pipeline with a series of stages set in it to azure blob storage giving relevant write parameters, but it still complains of `fs.azure.account.key` not being found in the configuration.

Sharing the code.

val spark = SparkSession.builder().appName("main").master("local[4]").getOrCreate()

import spark.implicits._

val df = spark.createDataFrame(Seq(
  (0L, "a b c d e spark"),
  (1L, "b d")
)).toDF("id", "text") 

val si = new StringIndexer().setInputCol("text").setOutputCol("IND_text")
val pipelinee = new Pipeline().setStages(Array(si))
val pipelineModel = pipelinee.fit(df)
val path = BLOB_STORAGE_PATH

pipelineModel.write
.option("spark.hadoop.fs.azure.account.key.<account_name>.dfs.core.windows.net",  "__").option("fs.azure.account.key.<account_name>.dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.endpoint.<account_name>.dfs.core.windows.net", "__").option("fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net", "__").option("fs.azure.account.auth.type.<account_name>.dfs.core.windows.net","__").option("fs.azure.account.oauth2.client.secret.<account_name>.dfs.core.windows.net", "__").option("fs.azure.account.oauth.provider.type.<account_name>.dfs.core.windows.net", "__")
.save(path)

The error that i get is

 Failure to initialize configuration
Caused by: InvalidConfigurationValueException: Invalid configuration value detected for fs.azure.account.key
at org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
    at org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:548)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1449)

This shows that even though the key,value of

spark.hadoop.fs.azure.account.key.<account_name>.dfs.core.windows.net

is being sent via option param, but is not being set internally.

while this works only if i explicitly set the values in the

spark.conf.set(key,value)

which might be problematic for a multi-tenant solution, which can be using the same spark context.

one other observation is

df.write.option(key1,value1).option(key2,value2).save(path)

fails with same key error while,

map = Map(key1->value1, key2->value2)  
df.write.options(map).save(path)

works..

Help required on: Similar to how dataframes `options`

df.write.options(Map<key,value>)

helps to set the configuration, the .option(key1, value1) should also work to write to azure blob storage.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Chhavi Bansal

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/May/24 17:40

Updated:: 19/Jun/24 18:44