When node labels are in use, and yarn.node-labels.fs-store.root-dir is set to a hdfs:// path, and the cluster is using kerberos, the RM fails to start while trying to unmarshal the label store. The following error/stack trace is observed:
I think this is a startup ordering issue, in that the scheduler is initialized before the RM would prime the cred cache. My reasoning is based on what happens when I don't set the yarn.node-labels.fs-store.root-dir property, so no HDFS interaction happens when the scheduler initializes. Here is the relevant snippet from the log:
You can see the scheduler initializes, and only then does the cred cache get primed. This results in a successful RM start, but of course my HDFS-backed labels are now not loaded.
I think that if the cred cached were initialized before the scheduler, this error would not happen.