A few of us had a phone call this morning. We briefly discussed a design for this, summarized below:
- The metastore should make use of the delegation token facilities in Hadoop Common. The classes in Common are already generic since they're used by both MR and HDFS for their delegation token types.
- The metastore needs to keep track of active delegation tokens across restarts - it probably makes sense to use the existing DB backing store for this.
- The metastore thrift API will need a new call, something like: binary getDelegationToken(1: string renewer) which returns the opaque token.
- We'll need to make some changes to HadoopThriftAuthBridge from
HIVE-842 in order to support using a delegation token over SASL.
In terms of the use cases above, here are some thoughts on how the delegation tokens will be used:
MR tasks reporting statistics
When a hive job is submitted, it will first obtain a DT from the hive metastore. This DT will be passed with the job, either as a private distributedcache file, or maybe base64-encoded in the jobconf itself. The MR tasks themselves will then load the token into the UGI before making calls. This is basically the pattern that normal hadoop MR jobs use to access HDFS from within a task.
Oozie or Hive Server jobs
Before Oozie or Hive Server forks the child process which actually runs the job, it will need to obtain a delegation token from the metastore on behalf of the user running the job. It will then provide this to the child process using an environment variable or configuration property. In this case, Oozie or the Hive Server needs to be configured as a "proxy superuser" on the metastore - ie the oozie/_HOST or hiveserver/_HOST principal is allowed to impersonate other users in order to grab delegation tokens for them.