Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44976

Preserve full principal user name on executor side

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.3, 3.3.3, 3.4.1
    • None
    • Spark Core

    Description

      SPARK-6558 changes the behavior of Utils.getCurrentUserName() to use shortname instead of full principal name.
      Due to this, it doesn't respect hadoop.security.auth_to_local rule on the side of non-kerberized hdfs namenode.
      For example, I use 2 hdfs cluster. One is kerberized, the other one is not kerberized.
      I make a rule to add some prefix to username on the non-kerberized cluster if some one access it from the kerberized cluster.

        <property>
          <name>hadoop.security.auth_to_local</name>
          <value xml:space="preserve">
      RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
      RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/
      DEFAULT</value>
        </property>
      

      However, if I submit spark job with keytab & principal option, hdfs directory and files ownership is not coherent.

      (I change some words for privacy.)

      $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23
      Found 52 items
      -rw-rw-rw-   3 _ex_eub hdfs          0 2023-05-11 00:16 hdfs:///user/eub/some/path/20230510/23/_SUCCESS
      -rw-r--r--   3 eub      hdfs  134418857 2023-05-11 00:15 hdfs:///user/eub/some/path/20230510/23/part-00000-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
      -rw-r--r--   3 eub      hdfs  153410049 2023-05-11 00:16 hdfs:///user/eub/some/path/20230510/23/part-00001-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
      -rw-r--r--   3 eub      hdfs  157260989 2023-05-11 00:16 hdfs:///user/eub/some/path/20230510/23/part-00002-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
      -rw-r--r--   3 eub      hdfs  156222760 2023-05-11 00:16 hdfs:///user/eub/some/path/20230510/23/part-00003-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz
      

      Another interesting point is that if I submit spark job without keytab and principal option but with kerberos authentication with kinit, it will not follow hadoop.security.auth_to_local rule completely.

      $ hdfs dfs -ls  hdfs:///user/eub/output/
      Found 3 items
      -rw-rw-r--+  3 eub hdfs          0 2023-08-25 12:31 hdfs:///user/eub/output/_SUCCESS
      -rw-rw-r--+  3 eub hdfs        512 2023-08-25 12:31 hdfs:///user/eub/output/part-00000.gz
      -rw-rw-r--+  3 eub hdfs        574 2023-08-25 12:31 hdfs:///user/eub/output/part-00001.gz
      

      I finally found that if I submit spark job with -principal and -keytab option, ugi will be different.
      (refer to https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905).

      Only file (_SUCCESS) and output directory created by driver (application master side) will respect hadoop.security.auth_to_local on the non-kerberized namenode only if -principal and {{-keytab}] options are provided.

      No matter how hdfs files or directory are created by executor or driver, those should respect hadoop.security.auth_to_local rule and should be the same.

      Workaround is to pass additional argument to change SPARK_USER on the executor side.
      e.g. --conf spark.executorEnv.SPARK_USER=_ex_eub

      --conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub will make an error. There are some logics to append environment value with : (colon) as a separator.

      Attachments

        Activity

          People

            Unassigned Unassigned
            eub YUBI LEE
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: