The current V2 Datasource API provides support for querying a portion of the SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport API. This was designed with the assumption that all configuration information for v2 data sources should be separate from each other.
Unfortunately, there are some cross-cutting concerns such as authentication that touch multiple data sources - this means that common configuration items need to be shared amongst multiple data sources.
In particular, Kerberos setup can use the following configuration items:
- userPrincipal, spark configuration:: spark.yarn.principal
- userKeytabPath spark configuration: spark.yarn.keytab
- krb5ConfPath: java.security.krb5.conf
- kerberos debugging flag: sun.security.krb5.debug
- JAAS config: java.security.auth.login.config ??
- ZKServerPrincipal ??
So potential solutions to pass this information to various data sources are:
- Pass the entire SparkContext object to data sources (not likely)
- Pass the entire SparkConfig Map object to data sources
- Pass all required configuration via environment variables
- Extend SessionConfigSupport to support passing specific white-listed configuration values
- Add a specific data source v2 API "SupportsKerberos" so that a data source can indicate that it supports Kerberos and also provide the means to pass needed configuration info.
- Expand out all Kerberos configuration items to be in each data source config namespace that needs it.
If the data source requires TLS support then we also need to support passing all the configuration values under "spark.ssl.*"