Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5663

Add an interface to Input/Ouput Formats to obtain delegation tokens

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      Currently, delegation tokens are obtained as part of the getSplits / checkOutputSpecs calls to the InputFormat / OutputFormat respectively.

      This works as long as the splits are generated on a node with kerberos credentials. For split generation elsewhere (AM for example), an explicit interface is required.

      1. MAPREDUCE-5663.6.txt
        39 kB
        Siddharth Seth
      2. MAPREDUCE-5663.5.txt
        38 kB
        Siddharth Seth
      3. MAPREDUCE-5663.4.txt
        38 kB
        Siddharth Seth
      4. MAPREDUCE-5663.patch.txt3
        12 kB
        Michael Weng
      5. MAPREDUCE-5663.patch.txt2
        13 kB
        Michael Weng
      6. MAPREDUCE-5663.patch.txt
        7 kB
        Michael Weng

        Activity

        Hide
        Siddharth Seth added a comment -

        I think an URI[] (all of the same scheme being passed to a the corresponding CredentialsProvider impl should be enough, no?

        As long as it's possible to represent services as URIs, which seems possible - it'll likely just be for the purpose of fetching credentials for hbase, hcat etc.

        Show
        Siddharth Seth added a comment - I think an URI[] (all of the same scheme being passed to a the corresponding CredentialsProvider impl should be enough, no? As long as it's possible to represent services as URIs, which seems possible - it'll likely just be for the purpose of fetching credentials for hbase, hcat etc.
        Hide
        Alejandro Abdelnur added a comment -

        From what I understand, For HBase, or HDFS HA & Yarn HA, it is the corresponding client library the one that resolves the real host, so this would be taken care by the use of it (of the client library, hbase, hdfs, yarn) from within the CredentialsProvider implementation for that service. I think an URI[] (all of the same scheme being passed to a the corresponding CredentialsProvider impl should be enough, no?

        Show
        Alejandro Abdelnur added a comment - From what I understand, For HBase, or HDFS HA & Yarn HA, it is the corresponding client library the one that resolves the real host, so this would be taken care by the use of it (of the client library, hbase, hdfs, yarn) from within the CredentialsProvider implementation for that service. I think an URI[] (all of the same scheme being passed to a the corresponding CredentialsProvider impl should be enough, no?
        Hide
        Devaraj Das added a comment -

        Haven't read the issue in detail yet, but a quick comment on what Siddharth Seth said. In HBase, we have client talking to ZK (preconfigured quorum), and it locates the master from a specific znode (also preconfigured). The URI could be faked there (either as hbase:// and reading the ZK data from the configuration that is available in the context, or, hbase://quorum/znode as Enis Soztutar just noted).

        Show
        Devaraj Das added a comment - Haven't read the issue in detail yet, but a quick comment on what Siddharth Seth said. In HBase, we have client talking to ZK (preconfigured quorum), and it locates the master from a specific znode (also preconfigured). The URI could be faked there (either as hbase:// and reading the ZK data from the configuration that is available in the context, or, hbase://quorum/znode as Enis Soztutar just noted).
        Hide
        Siddharth Seth added a comment -

        I'm not against this, just not sure how existing services function if they're asked for tokens with security disabled. HDFS, afaik, works just fine.

        ... JobSubmitter#populateTokenCache() method which is called by JobSubmitter#submitJobInternal() ...

        All of these methods are invoked, but end up calling TokenCache.obtainTokensForNameNodeInternal - which short circuits the fetch based on the security settings.

        That's an interesting proposal.
        Each application would have to figure out how it gets this list of URIs. MR, if it chooses can have Input/OutputFormats implement an interface to retrieve URIs instead of getting the Credentials directly.

        Spoke Devaraj Das on how this could work for HBase - I'm not sure if all systems which provide credentials can have their information represented as a URI. In case of HBase, I believe this is a quorum, which is available in the Configuration. For HBase, this could potentially be faked by setting the URI as hbase://

        Alrternately, this could accept a list of Strings instead of URIs, or even a <String, URI> pair - where the first part represents the provider, and the second one is the URI - if applicable.

        Show
        Siddharth Seth added a comment - I'm not against this, just not sure how existing services function if they're asked for tokens with security disabled. HDFS, afaik, works just fine. ... JobSubmitter#populateTokenCache() method which is called by JobSubmitter#submitJobInternal() ... All of these methods are invoked, but end up calling TokenCache.obtainTokensForNameNodeInternal - which short circuits the fetch based on the security settings. That's an interesting proposal. Each application would have to figure out how it gets this list of URIs. MR, if it chooses can have Input/OutputFormats implement an interface to retrieve URIs instead of getting the Credentials directly. Spoke Devaraj Das on how this could work for HBase - I'm not sure if all systems which provide credentials can have their information represented as a URI. In case of HBase, I believe this is a quorum, which is available in the Configuration. For HBase, this could potentially be faked by setting the URI as hbase:// Alrternately, this could accept a list of Strings instead of URIs, or even a <String, URI> pair - where the first part represents the provider, and the second one is the URI - if applicable.
        Hide
        Alejandro Abdelnur added a comment -

        ... I’m not too sure about - mainly from the perspective of services not handling getToken requests correctly if security is disabled

        We are moving away from this, in Yarn we always use tokens, regardless of the security configuration. Oozie needs tokens to be there in order to work correctly.

        ... The JobClient currently doesn't do this, at least for HDFS.

        Actually, yes it does do this if you set the MRJobConfig.JOB_NAMENODES property, this is done in the JobSubmitter#populateTokenCache() method which is called by JobSubmitter#submitJobInternal() which is called by JobSubmitter#submit(). All this is done in the main execution path, thus always done when doing a submit. It is independent of split computations.

        ... For HBase / HCatalog sources which are outside of the IF/OF for a MR job - I don't think we have the capability for fetching tokens, and rely on the user providing them up front.

        Actually, we are fetching them upfront only because this was needed for MR jobs, but MR shouldn’t be a special case. Oozie has the concept of CredentialsProvider for this very same reason. And I think with this JIRA we can fix this in a general case.

        ... Would this utility class know how to handle all kinds of URIs ?

        Yes, based on registered handlers for different schemes, more on this follows.

        My thinking on how to address this is to use the same pattern we are doing today for loading/registering FileSystem, CompressionCodec, TokenRenewers, SecurityInfo implementations. Using JDK’s ServiceLoader mechanism to load all available implementations of the following interface:

        /**
         * Implementations must be thread-safe.
         */
        public interface CredentialsProvider {
        
         /**
          * Reports the scheme being supported by this provider.
          */
         public String getScheme();
        
         /**
          * Obtains delegations tokens for the provided URIs.
          *
          * @param conf configuration used to initialize the components that connect to the specified URIs.
          * @param uris URIs of services to obtain delegation tokens from.
          * @ param targetCredentials credentials to add the fetched delegation tokens.
          */
         public void obtainCredentials(Configuration conf, URI[] uris, Credentials targetCredentials) throws IOException;
        

        Then we would have a CredentialsProvider class that would use a ServiceLoader to load all credentials available in the classpatch (via the ServiceLoader mechanism, the nice thing about this is that you drop a JAR file with a service implementation and you don’t have to configure anything, it just works provided you have the META-INF/services/... file for it). This would be done in a class static block initialization.

        the CredentialsProvider would have a static method fetchCredentials(Configuration, URI[], Credentials) which sorts out the URIs by scheme and then invokes the corresponding CredentialsProvider impl for it.

        Then the different Yarn applications define a property in the conf to indicate the URIs of the services to get tokens and their client submission code does it (like the JobSubmitter does with MRJobConfig.JOB_NAMENODES but in a general way. Frameworks may chose to be smarter (in the case of MR get the URIS from the splits an the output dir and get the tokens automatically).

        Show
        Alejandro Abdelnur added a comment - ... I’m not too sure about - mainly from the perspective of services not handling getToken requests correctly if security is disabled We are moving away from this, in Yarn we always use tokens, regardless of the security configuration. Oozie needs tokens to be there in order to work correctly. ... The JobClient currently doesn't do this, at least for HDFS. Actually, yes it does do this if you set the MRJobConfig.JOB_NAMENODES property, this is done in the JobSubmitter#populateTokenCache() method which is called by JobSubmitter#submitJobInternal() which is called by JobSubmitter#submit() . All this is done in the main execution path, thus always done when doing a submit. It is independent of split computations. ... For HBase / HCatalog sources which are outside of the IF/OF for a MR job - I don't think we have the capability for fetching tokens, and rely on the user providing them up front. Actually, we are fetching them upfront only because this was needed for MR jobs, but MR shouldn’t be a special case. Oozie has the concept of CredentialsProvider for this very same reason. And I think with this JIRA we can fix this in a general case. ... Would this utility class know how to handle all kinds of URIs ? Yes, based on registered handlers for different schemes, more on this follows. My thinking on how to address this is to use the same pattern we are doing today for loading/registering FileSystem , CompressionCodec , TokenRenewers , SecurityInfo implementations. Using JDK’s ServiceLoader mechanism to load all available implementations of the following interface: /** * Implementations must be thread-safe. */ public interface CredentialsProvider { /** * Reports the scheme being supported by this provider. */ public String getScheme(); /** * Obtains delegations tokens for the provided URIs. * * @param conf configuration used to initialize the components that connect to the specified URIs. * @param uris URIs of services to obtain delegation tokens from. * @ param targetCredentials credentials to add the fetched delegation tokens. */ public void obtainCredentials(Configuration conf, URI[] uris, Credentials targetCredentials) throws IOException; Then we would have a CredentialsProvider class that would use a ServiceLoader to load all credentials available in the classpatch (via the ServiceLoader mechanism, the nice thing about this is that you drop a JAR file with a service implementation and you don’t have to configure anything, it just works provided you have the META-INF/services/... file for it). This would be done in a class static block initialization. the CredentialsProvider would have a static method fetchCredentials(Configuration, URI[], Credentials) which sorts out the URIs by scheme and then invokes the corresponding CredentialsProvider impl for it. Then the different Yarn applications define a property in the conf to indicate the URIs of the services to get tokens and their client submission code does it (like the JobSubmitter does with MRJobConfig.JOB_NAMENODES but in a general way. Frameworks may chose to be smarter (in the case of MR get the URIS from the splits an the output dir and get the tokens automatically).
        Hide
        Alejandro Abdelnur added a comment -

        planning to comment later this morning. sorry yesterday got caught on diff things.

        Show
        Alejandro Abdelnur added a comment - planning to comment later this morning. sorry yesterday got caught on diff things.
        Hide
        Siddharth Seth added a comment -

        Alejandro Abdelnur, do you have reservations on this approach - where components which need to provide Credentials implement this interface. Otherwise, I'd like to move forward with this jira.

        Show
        Siddharth Seth added a comment - Alejandro Abdelnur , do you have reservations on this approach - where components which need to provide Credentials implement this interface. Otherwise, I'd like to move forward with this jira.
        Hide
        Siddharth Seth added a comment -

        DelegationTokens should be always requested by the client, security enabled or not, computing the splits on the client or not.

        I think the client requesting the required tokens is required (directly or indirectly). Whether this is done independent of security is something I'm not too sure about - mainly from the perspective of services not handling getToken requests correctly if security is diabled. The JobClient currently doesn't do this, at least for HDFS.

        DelegationTokens fetching should be done regardless of the IF/OF implementation (take the case of talking with Hbase or HCatalog, job working dir service).

        The intent of adding this interface is to be able to fetch tokens irrespective of the IF/OF - assuming the IF/OF implement the interface. For HBase / HCatalog sources which are outside of the IF/OF for a MR job - I don't think we have the capability for fetching tokens, and rely on the user providing them up front. That seems like a reasonable approach for now. Alternately, we could add a config specifying a list of classes which implement this interface - and can be invoked by the client code.

        DelegationTokens fetching should not be tied to split computation.

        Completely agree with this. I don't think we can do this though - without making an incompatible change. We could explicitly fetch Credentials (if the interface is implemented), but at least some existing IF/OFs will continue to rely on getSplits / checkOutputSpecs for tokens.

        We could have a utility class that we pass a UGI, list of service URIs and returns a populated Credentials with tokens for all the specified services. The IF/OF/Job would have to be able to extract the required URIs for the job.

        Would this utility class know how to handle all kinds of URIs ? I think it's better to leave the implementation of the Credentials Fetching code to the specific system (MR / HBase / HCatalog). Configure a list of CredentialProviders - which know how to fetch Credentials for the specific system.

        Show
        Siddharth Seth added a comment - DelegationTokens should be always requested by the client, security enabled or not, computing the splits on the client or not. I think the client requesting the required tokens is required (directly or indirectly). Whether this is done independent of security is something I'm not too sure about - mainly from the perspective of services not handling getToken requests correctly if security is diabled. The JobClient currently doesn't do this, at least for HDFS. DelegationTokens fetching should be done regardless of the IF/OF implementation (take the case of talking with Hbase or HCatalog, job working dir service). The intent of adding this interface is to be able to fetch tokens irrespective of the IF/OF - assuming the IF/OF implement the interface. For HBase / HCatalog sources which are outside of the IF/OF for a MR job - I don't think we have the capability for fetching tokens, and rely on the user providing them up front. That seems like a reasonable approach for now. Alternately, we could add a config specifying a list of classes which implement this interface - and can be invoked by the client code. DelegationTokens fetching should not be tied to split computation. Completely agree with this. I don't think we can do this though - without making an incompatible change. We could explicitly fetch Credentials (if the interface is implemented), but at least some existing IF/OFs will continue to rely on getSplits / checkOutputSpecs for tokens. We could have a utility class that we pass a UGI, list of service URIs and returns a populated Credentials with tokens for all the specified services. The IF/OF/Job would have to be able to extract the required URIs for the job. Would this utility class know how to handle all kinds of URIs ? I think it's better to leave the implementation of the Credentials Fetching code to the specific system (MR / HBase / HCatalog). Configure a list of CredentialProviders - which know how to fetch Credentials for the specific system.
        Hide
        Siddharth Seth added a comment -

        DelegationTokens should be always requested by the client, security enabled or not, computing the splits on the client or not.

        I think the client requesting the required tokens is required (directly or indirectly). Whether this is done independent of security is something I'm not too sure about - mainly from the perspective of services not handling getToken requests correctly if security is diabled. The JobClient currently doesn't do this, at least for HDFS.

        DelegationTokens fetching should be done regardless of the IF/OF implementation (take the case of talking with Hbase or HCatalog, job working dir service).

        The intent of adding this interface is to be able to fetch tokens irrespective of the IF/OF - assuming the IF/OF implement the interface. For HBase / HCatalog sources which are outside of the IF/OF for a MR job - I don't think we have the capability for fetching tokens, and rely on the user providing them up front. That seems like a reasonable approach for now. Alternately, we could add a config specifying a list of classes which implement this interface - and can be invoked by the client code.

        DelegationTokens fetching should not be tied to split computation.

        Completely agree with this. I don't think we can do this though - without making an incompatible change. We could explicitly fetch Credentials (if the interface is implemented), but at least some existing IF/OFs will continue to rely on getSplits / checkOutputSpecs for tokens.

        We could have a utility class that we pass a UGI, list of service URIs and returns a populated Credentials with tokens for all the specified services. The IF/OF/Job would have to be able to extract the required URIs for the job.

        Would this utility class know how to handle all kinds of URIs ? I think it's better to leave the implementation of the Credentials Fetching code to the specific system (MR / HBase / HCatalog). Configure a list of CredentialProviders - which know how to fetch Credentials for the specific system.

        Show
        Siddharth Seth added a comment - DelegationTokens should be always requested by the client, security enabled or not, computing the splits on the client or not. I think the client requesting the required tokens is required (directly or indirectly). Whether this is done independent of security is something I'm not too sure about - mainly from the perspective of services not handling getToken requests correctly if security is diabled. The JobClient currently doesn't do this, at least for HDFS. DelegationTokens fetching should be done regardless of the IF/OF implementation (take the case of talking with Hbase or HCatalog, job working dir service). The intent of adding this interface is to be able to fetch tokens irrespective of the IF/OF - assuming the IF/OF implement the interface. For HBase / HCatalog sources which are outside of the IF/OF for a MR job - I don't think we have the capability for fetching tokens, and rely on the user providing them up front. That seems like a reasonable approach for now. Alternately, we could add a config specifying a list of classes which implement this interface - and can be invoked by the client code. DelegationTokens fetching should not be tied to split computation. Completely agree with this. I don't think we can do this though - without making an incompatible change. We could explicitly fetch Credentials (if the interface is implemented), but at least some existing IF/OFs will continue to rely on getSplits / checkOutputSpecs for tokens. We could have a utility class that we pass a UGI, list of service URIs and returns a populated Credentials with tokens for all the specified services. The IF/OF/Job would have to be able to extract the required URIs for the job. Would this utility class know how to handle all kinds of URIs ? I think it's better to leave the implementation of the Credentials Fetching code to the specific system (MR / HBase / HCatalog). Configure a list of CredentialProviders - which know how to fetch Credentials for the specific system.
        Hide
        Alejandro Abdelnur added a comment -

        The Oozie server is responsible for obtaining all the tokens the main job may need:

        • tokens to run the job (working dir, jobtokens)
        • tokens for the Input and Output data (typically HDFS tokens, but they can be for different file systems, for Hbase, for HCatalog, etc).

        For the typical case of running an MR job (directly or via Pig/Hive), the tokens of launcher job are sufficient for the main job. They just need to be propagated. The Oozie server makes sure the "mapreduce.job.complete.cancel.delegation.tokens" property is set to FALSE for the launcher job (Oozie gets rid of the launcher job for MR jobs once the main job is running).

        For scenarios where the main job needs to interact with different services, Oozie must acquire them in advance. For HDFS this is done by simply setting the "MRJobConfig.JOB_NAMENODES" property, then the launcher job submission will get those tokens. For Hbase or HCatalog, Oozie has a CredentialsProvider that obtains those tokens (the requirement here is that Oozie is configured as proxy user in those services in order to get tokens for the user submitting the job).

        From what it seems you are after generalizing this. If think we should do it with a slightly twist from what you are proposing:

        • DelegationTokens should be always requested by the client, security enabled or not, computing the splits on the client or not.
        • DelegationTokens fetching should be done regardless of the IF/OF implementation (take the case of talking with Hbase or HCatalog, job working dir service).
        • DelegationTokens fetching should not be tied to split computation.

        We could have a utility class that we pass a UGI, list of service URIs and returns a populated Credentials with tokens for all the specified services.

        The IF/OF/Job would have to be able to extract the required URIs for the job.

        Also, this mechanism could be used to obtain ALL tokens the AM needs.

        Show
        Alejandro Abdelnur added a comment - The Oozie server is responsible for obtaining all the tokens the main job may need: tokens to run the job (working dir, jobtokens) tokens for the Input and Output data (typically HDFS tokens, but they can be for different file systems, for Hbase, for HCatalog, etc). For the typical case of running an MR job (directly or via Pig/Hive), the tokens of launcher job are sufficient for the main job. They just need to be propagated. The Oozie server makes sure the "mapreduce.job.complete.cancel.delegation.tokens" property is set to FALSE for the launcher job (Oozie gets rid of the launcher job for MR jobs once the main job is running). For scenarios where the main job needs to interact with different services, Oozie must acquire them in advance. For HDFS this is done by simply setting the "MRJobConfig.JOB_NAMENODES" property, then the launcher job submission will get those tokens. For Hbase or HCatalog, Oozie has a CredentialsProvider that obtains those tokens (the requirement here is that Oozie is configured as proxy user in those services in order to get tokens for the user submitting the job). From what it seems you are after generalizing this. If think we should do it with a slightly twist from what you are proposing: DelegationTokens should be always requested by the client, security enabled or not, computing the splits on the client or not. DelegationTokens fetching should be done regardless of the IF/OF implementation (take the case of talking with Hbase or HCatalog, job working dir service). DelegationTokens fetching should not be tied to split computation. We could have a utility class that we pass a UGI, list of service URIs and returns a populated Credentials with tokens for all the specified services. The IF/OF/Job would have to be able to extract the required URIs for the job. Also, this mechanism could be used to obtain ALL tokens the AM needs.
        Hide
        Siddharth Seth added a comment -

        That's two sets of tokens that are obtained - for the working directory, and for any additional HDFS servers which the user may have configured.
        In addition to this, tokens may be obtained by Input/OutputFormats

        From FileInputFormat

        Path[] dirs = getInputPaths(job);
            if (dirs.length == 0) {
              throw new IOException("No input paths specified in job");
            }
            
            // get tokens for all the required FileSystems..
            TokenCache.obtainTokensForNamenodes(job.getCredentials(), dirs, 
                                                job.getConfiguration());
        

        getInputPaths reads the property "mapreduce.input.fileinputformat.inputdir" - which is specific to FIF. If the input paths reside on a different Namenode than the one on which the staging directory is, I don't think users must set MRJobConfig.JOB_NAMENODES. The tokens would just be picked up as part of client side split generation.

        In terms of Oozie, from what I understand, the JobSubmitter does not get invoked on a box with kerberos credentials - not for the main job anyway (maybe for the launcher) - so this code to obtain tokens doesn't kick in. If that's the case, my guess is Oozie has additional configuration, and explicitly goes out and fetches tokens before submitting the launcher.

        Show
        Siddharth Seth added a comment - That's two sets of tokens that are obtained - for the working directory, and for any additional HDFS servers which the user may have configured. In addition to this, tokens may be obtained by Input/OutputFormats From FileInputFormat Path[] dirs = getInputPaths(job); if (dirs.length == 0) { throw new IOException( "No input paths specified in job" ); } // get tokens for all the required FileSystems.. TokenCache.obtainTokensForNamenodes(job.getCredentials(), dirs, job.getConfiguration()); getInputPaths reads the property "mapreduce.input.fileinputformat.inputdir" - which is specific to FIF. If the input paths reside on a different Namenode than the one on which the staging directory is, I don't think users must set MRJobConfig.JOB_NAMENODES. The tokens would just be picked up as part of client side split generation. In terms of Oozie, from what I understand, the JobSubmitter does not get invoked on a box with kerberos credentials - not for the main job anyway (maybe for the launcher) - so this code to obtain tokens doesn't kick in. If that's the case, my guess is Oozie has additional configuration, and explicitly goes out and fetches tokens before submitting the launcher.
        Hide
        Alejandro Abdelnur added a comment -

        This works out of the box for MR jobs because typically the same FileSystem where the IN/OUT data resides is he one used for the submission dir.

        If you need to use different FileSystems (i.e. distcp), this is achieved setting the MRJobConfig.JOB_NAMENODES property in the job confguration, this is handled in the JobSubmitter.java in the following code:

          //get secret keys and tokens and store them into TokenCache
          private void populateTokenCache(Configuration conf, Credentials credentials) 
          throws IOException{
            readTokensFromFiles(conf, credentials);
            // add the delegation tokens from configuration
            String [] nameNodes = conf.getStrings(MRJobConfig.JOB_NAMENODES);
            LOG.debug("adding the following namenodes' delegation tokens:" + 
                Arrays.toString(nameNodes));
            if(nameNodes != null) {
              Path [] ps = new Path[nameNodes.length];
              for(int i=0; i< nameNodes.length; i++) {
                ps[i] = new Path(nameNodes[i]);
              }
              TokenCache.obtainTokensForNamenodes(credentials, ps, conf);
            }
          }
        
        Show
        Alejandro Abdelnur added a comment - This works out of the box for MR jobs because typically the same FileSystem where the IN/OUT data resides is he one used for the submission dir. If you need to use different FileSystems (i.e. distcp), this is achieved setting the MRJobConfig.JOB_NAMENODES property in the job confguration, this is handled in the JobSubmitter.java in the following code: //get secret keys and tokens and store them into TokenCache private void populateTokenCache(Configuration conf, Credentials credentials) throws IOException{ readTokensFromFiles(conf, credentials); // add the delegation tokens from configuration String [] nameNodes = conf.getStrings(MRJobConfig.JOB_NAMENODES); LOG.debug( "adding the following namenodes' delegation tokens:" + Arrays.toString(nameNodes)); if (nameNodes != null ) { Path [] ps = new Path[nameNodes.length]; for ( int i=0; i< nameNodes.length; i++) { ps[i] = new Path(nameNodes[i]); } TokenCache.obtainTokensForNamenodes(credentials, ps, conf); } }
        Hide
        Siddharth Seth added a comment -

        That's only for the submission directory. The tokens for the actual data may be different, and is tied to the I/OFormats. Does this code actually get invoked when submitting a job to Oozie (on the client machine) ?
        What I don't understand is the following - When a job is submitted to Oozie, How does Oozie know which tokens it needs to obtain on the client machine ? (which namenodes will be used to execute the MR job / workflow, tokens from other sources ?, etc).

        Show
        Siddharth Seth added a comment - That's only for the submission directory. The tokens for the actual data may be different, and is tied to the I/OFormats. Does this code actually get invoked when submitting a job to Oozie (on the client machine) ? What I don't understand is the following - When a job is submitted to Oozie, How does Oozie know which tokens it needs to obtain on the client machine ? (which namenodes will be used to execute the MR job / workflow, tokens from other sources ?, etc).
        Hide
        Alejandro Abdelnur added a comment -

        This is done in the MR JobSubmitter.java, in the submitJobInternal(...) method:

              // get delegation token for the dir
              TokenCache.obtainTokensForNamenodes(job.getCredentials(),
                  new Path[] { submitJobDir }, conf);
        
              populateTokenCache(conf, job.getCredentials());
        

        Is this what you are after?

        Show
        Alejandro Abdelnur added a comment - This is done in the MR JobSubmitter.java , in the submitJobInternal(...) method: // get delegation token for the dir TokenCache.obtainTokensForNamenodes(job.getCredentials(), new Path[] { submitJobDir }, conf); populateTokenCache(conf, job.getCredentials()); Is this what you are after?
        Hide
        Siddharth Seth added a comment -

        On the AM client side you collect all the tokens you need and write them to HDFS using the Credentials.writeTokenStorageFile() method to HDFS.

        The question is on what basis are these tokens are collected. Once tokens are available - it's fairly straightforward to make them available elsewhere on the cluster.
        Currently, tokens are automatically acquired when the MR JobClient calls getSplits() and checkOutputSpecs. How tokens are obtained in these method calls really depends upon the Input / OutputFormat being used.
        As far as I know, Oozie doesn't actually call these methods on the client - I'm not sure how it knows which tokens are required - guessing this is some additional Oozie configuration.

        Vinod Kumar Vavilapalli, yes MAPREDUCE-207 is a scenario for this, where we want to skip at least the getSplits() operation on the client.

        Show
        Siddharth Seth added a comment - On the AM client side you collect all the tokens you need and write them to HDFS using the Credentials.writeTokenStorageFile() method to HDFS. The question is on what basis are these tokens are collected. Once tokens are available - it's fairly straightforward to make them available elsewhere on the cluster. Currently, tokens are automatically acquired when the MR JobClient calls getSplits() and checkOutputSpecs. How tokens are obtained in these method calls really depends upon the Input / OutputFormat being used. As far as I know, Oozie doesn't actually call these methods on the client - I'm not sure how it knows which tokens are required - guessing this is some additional Oozie configuration. Vinod Kumar Vavilapalli , yes MAPREDUCE-207 is a scenario for this, where we want to skip at least the getSplits() operation on the client.
        Hide
        Alejandro Abdelnur added a comment -

        Siddharth Seth, for MR this is fully cooked.

        It works something like this:

        • On the AM client side you collect all the tokens you need and write them to HDFS using the Credentials.writeTokenStorageFile() method to HDFS.
        • the HADOOP_TOKEN_FILE_LOCATION env variable pointing to such file is set to the AM environment.
        • Then when calling UGI.getLoginUser() on the AM, the UGI credentials should be populated with the contents of the token file writen by the AM client.
        Show
        Alejandro Abdelnur added a comment - Siddharth Seth , for MR this is fully cooked. It works something like this: On the AM client side you collect all the tokens you need and write them to HDFS using the Credentials.writeTokenStorageFile() method to HDFS. the HADOOP_TOKEN_FILE_LOCATION env variable pointing to such file is set to the AM environment. Then when calling UGI.getLoginUser() on the AM, the UGI credentials should be populated with the contents of the token file writen by the AM client.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        I suppose the scenario is MAPREDUCE-207?

        Show
        Vinod Kumar Vavilapalli added a comment - I suppose the scenario is MAPREDUCE-207 ?
        Hide
        Siddharth Seth added a comment -

        if the AM has the corresponding delegation tokens, things work just fine, Oozie has been doing this for years; the splits are computed in the launcher job which does not have kerberos credentials.

        Alejandro Abdelnur, My guess is the launcher job is able to compute splits because it already has access to the tokens. Do you know how Oozie ensures that job specific tokens are available to the launcher, without calling getSplits etc on the client node.
        The intent of this JIRA is to be able to get the necessary tokens on the client (based on the Job Configuration), without actually invoking getSplits() and checkOutputSpecs. The token functionality is currently baked into these methods.

        Show
        Siddharth Seth added a comment - if the AM has the corresponding delegation tokens, things work just fine, Oozie has been doing this for years; the splits are computed in the launcher job which does not have kerberos credentials. Alejandro Abdelnur , My guess is the launcher job is able to compute splits because it already has access to the tokens. Do you know how Oozie ensures that job specific tokens are available to the launcher, without calling getSplits etc on the client node. The intent of this JIRA is to be able to get the necessary tokens on the client (based on the Job Configuration), without actually invoking getSplits() and checkOutputSpecs. The token functionality is currently baked into these methods.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12622693/MAPREDUCE-5663.6.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4317//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4317//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12622693/MAPREDUCE-5663.6.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 6 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4317//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4317//console This message is automatically generated.
        Hide
        Alejandro Abdelnur added a comment - - edited

        Siddharth Seth, Arun C Murthy, why is this needed? if the AM has the corresponding delegation tokens, things work just fine, Oozie has been doing this for years; the splits are computed in the launcher job which does not have kerberos credentials.

        Show
        Alejandro Abdelnur added a comment - - edited Siddharth Seth , Arun C Murthy , why is this needed? if the AM has the corresponding delegation tokens, things work just fine, Oozie has been doing this for years; the splits are computed in the launcher job which does not have kerberos credentials.
        Hide
        Siddharth Seth added a comment -

        Updated patch to rename the method to 'obtainCredentials', and added some more Javadoc. Arun C Murthy, please add additional javadocs if you think more is required.

        Show
        Siddharth Seth added a comment - Updated patch to rename the method to 'obtainCredentials', and added some more Javadoc. Arun C Murthy , please add additional javadocs if you think more is required.
        Hide
        Arun C Murthy added a comment -

        +1, mostly lgtm.

        One nit: I'd like to change the api to 'obtainCredentials' rather than 'addCredentials'. That threw me off a little. Last thing - a more descriptive javadoc would help too. Thanks.

        Show
        Arun C Murthy added a comment - +1, mostly lgtm. One nit: I'd like to change the api to 'obtainCredentials' rather than 'addCredentials'. That threw me off a little. Last thing - a more descriptive javadoc would help too. Thanks.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12621162/MAPREDUCE-5663.5.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4296//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4296//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12621162/MAPREDUCE-5663.5.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 6 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4296//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4296//console This message is automatically generated.
        Hide
        Siddharth Seth added a comment -

        Updated to fix the RAT warning - missing apache license in one file.

        Show
        Siddharth Seth added a comment - Updated to fix the RAT warning - missing apache license in one file.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12620710/MAPREDUCE-5663.4.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        -1 release audit. The applied patch generated 1 release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4294//testReport/
        Release audit warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4294//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4294//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12620710/MAPREDUCE-5663.4.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 6 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. -1 release audit . The applied patch generated 1 release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4294//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4294//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4294//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12620710/MAPREDUCE-5663.4.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        -1 release audit. The applied patch generated 1 release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4285//testReport/
        Release audit warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4285//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4285//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12620710/MAPREDUCE-5663.4.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 6 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. -1 release audit . The applied patch generated 1 release audit warnings. +1 core tests . The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4285//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4285//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4285//console This message is automatically generated.
        Hide
        Siddharth Seth added a comment -

        Updated patch after talking to Michael offline. Fixes stuff I'd mentioned in the previous comment, and adds some unit tests.

        Show
        Siddharth Seth added a comment - Updated patch after talking to Michael offline. Fixes stuff I'd mentioned in the previous comment, and adds some unit tests.
        Hide
        Siddharth Seth added a comment -

        Thanks for taking this up Michael Weng. , some comments on the patch.

        • FileInputFormat - a new public getInputPaths(...) method should not be added. This could instead be a private getInputPathsInternal(...)
        • FileOutputFormat - path normalization, similar to what is done, in checkOutputSpecs is likely required.
        • CredentialsProvider - the JavaDoc should be a little more descriptive as to the purpose of this interface. Was thinking along the lines of
          /**
             * <code>CredentialsProvider</code> is an interface that can be implemented by
             * components that may need to obtain credentials, which may
             * be required to function on a secured cluster.
             * 
             * @param conf
             *          component specific {@link Configuration}.
             * @param credentials
             *          an instance of {@link Credentials} to which credentials
             *          will be added.
             *
          /

          Could also do with some changes to the FIF/FOF unit tests.

        Show
        Siddharth Seth added a comment - Thanks for taking this up Michael Weng . , some comments on the patch. FileInputFormat - a new public getInputPaths(...) method should not be added. This could instead be a private getInputPathsInternal(...) FileOutputFormat - path normalization, similar to what is done, in checkOutputSpecs is likely required. CredentialsProvider - the JavaDoc should be a little more descriptive as to the purpose of this interface. Was thinking along the lines of /** * <code>CredentialsProvider</code> is an interface that can be implemented by * components that may need to obtain credentials, which may * be required to function on a secured cluster. * * @param conf * component specific {@link Configuration}. * @param credentials * an instance of {@link Credentials} to which credentials * will be added. * / Could also do with some changes to the FIF/FOF unit tests.
        Hide
        Michael Weng added a comment -

        Updated.

        Show
        Michael Weng added a comment - Updated.
        Hide
        Michael Weng added a comment -

        Update the patch with some fixes to xxxInputFormat and added changes for xxxOutputFormat.

        Show
        Michael Weng added a comment - Update the patch with some fixes to xxxInputFormat and added changes for xxxOutputFormat.
        Hide
        Michael Weng added a comment -

        Attached the initial patch based on GA 2.2.0.

        Show
        Michael Weng added a comment - Attached the initial patch based on GA 2.2.0.
        Hide
        Siddharth Seth added a comment -

        Proposed API / Changes

        interface CredentialsProvider (in hadoop-common)
        void addCredentials(Configuration conf, Credentials credentials);

        Change at least FileInput/OutputFormats to implement this interface to return the tokens that will eventually be required. We can additional Input/OutputFormats as required.

        Any system (MR/Tez) which wants to do AM split generation - would need to check if the interface is implemented by the relevant Input/OutputFormat - and decide whether to generate splits on the client side or just get tokens on the client side when running on a secure cluster.

        Show
        Siddharth Seth added a comment - Proposed API / Changes interface CredentialsProvider (in hadoop-common) void addCredentials(Configuration conf, Credentials credentials); Change at least FileInput/OutputFormats to implement this interface to return the tokens that will eventually be required. We can additional Input/OutputFormats as required. Any system (MR/Tez) which wants to do AM split generation - would need to check if the interface is implemented by the relevant Input/OutputFormat - and decide whether to generate splits on the client side or just get tokens on the client side when running on a secure cluster.

          People

          • Assignee:
            Michael Weng
            Reporter:
            Siddharth Seth
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:

              Development