Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
I noticed that in many places the Hadoop Configuration class is initialized via `HadoopUtils.newConfiguration()`. This class explicitly sets a few s3 related properties:
```
String awsAccessKeyId = System.getenv(AWS_ACCESS_KEY_ID);
String awsSecretAccessKey = System.getenv(AWS_SECRET_ACCESS_KEY);
if (awsAccessKeyId != null && awsSecretAccessKey != null)
```
While this works for s3 and s3n, it does not account for s3a. Additionally, it doesn't work for any other filesystems which may need properties.
It seems like it would be better to initialize the Hadoop Configuration based off of the combination of the config and job properties. If this was done, in the .properties file the credentials would be defined like:
```
- Hadoop configuration
fs.s3a.access.key=${env:AWS_ACCESS_KEY_ID}
fs.s3a.secret.key=${env:AWS_SECRET_ACCESS_KEY}
fs.s3a.buffer.dir=${env:GOBBLIN_WORK_DIR}/buffer
```
Additionally, they could be overridden in the job files. The existing password encryption could be used to secure this information if it is not being pulled from an environment variable.
Lastly, I would like to propose a further enhancement. If the configuration keys were specified like `data.publisher.fs.s3a.access.key`, then different uses of the filesystem could have different settings.
Feedback?
Github Url : https://github.com/linkedin/gobblin/issues/532
Github Reporter : jbaranick
Github Created At : 2015-12-16T17:12:10Z
Github Updated At : 2017-01-12T04:32:42Z
Comments
jbaranick wrote on 2015-12-16T18:01:34Z : Can you explain why HadoopUtils has two different mechanisms to create configuration? One that creates a configuration that doesn't pull any setting from properties: `public static Configuration newConfiguration()`. The other loads the properties from State and adds them to the configuration: `public static Configuration getConfFromState(State state)`.
Additionally, there is `JobConfigurationUtils.putStateIntoConfiguration(State state, Configuration configuration)`
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-165195748
ibuenros wrote on 2015-12-22T19:12:20Z : You are right that `Configuration` creation is inconsistent, and this prevents certain filesystems from working correctly. Besides the examples you mentioned, sometimes we just create them using `new Configuration()`. The reason for this is simply that we never centralized the creation of `Configuration`.
This is a high priority issue at this point. To avoid duplicating work, are you already working on a fix for this?
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-166706399
jbaranick wrote on 2015-12-22T19:13:35Z : @ibuenros I've reworked a couple internally to do what I suggested above. I can do more, but would like to lock down the approach.
It would be really helpful for me to understand the logic and reasoning behind `newConfiguration()`, `getConfFromState(State state)`, and `putStateIntoConfiguration(State state, Configuration configuration)`.
Additionally, do we need to figure out how to avoid storing sensitive information (such as AWS keys) in the state store or does the encryption shield us from leaking this info?
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-166706642
ibuenros wrote on 2015-12-22T19:18:13Z : From talks in LinkedIn, we think that there should be a centralized `newConfiguration(State)` method, probably in `HadoopUtils`. Additionally, there should be a method `HadoopUtils.newFileSystem(URI, State)` which uses `HadoopUtils.newConfiguration(State)`.
Also, the better approach instead of hard-coding the keys that are copied from `State` to `Configuration`, is to take any key with prefix `hadoop.inject` and add it to `Configuration` (for example `hadoop.inject.fs.s3.awsAccessKeyId` would be injected as `fs.s3.awsAccessKeyId`). This is the approach that Azkaban uses, and allows for using other `FileSystem`s in the future. How does this sound?
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-166707524
jbaranick wrote on 2015-12-22T19:33:56Z : @ibuenros My initial feeling is that `HadoopUtils.newConfiguration(State)` and `HadoopUtils.newFileSystem(URI, State)` are a good idea.
I also agree that hardcoding the keys is not the best solution. However, I had a different take on how to deal with that. I would prefer to have different components use different prefixes. For instance, when getting a configuration for a DataPublisher, we would inject all configuration values starting with `data.publisher.`. For example, `data.publisher.fs.s3a.access.key` would be injected as `fs.s3a.access.key`. This enabled different components to use different configuration. One scenario that I am thinking about is that we want to source data from one S3 location and publish it to a different S3 location.
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-166710749
jbaranick wrote on 2016-01-04T18:25:59Z : @ibuenros Can you comment on this so we can figure out how to proceed?
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-168759208
ibuenros wrote on 2016-01-04T18:37:41Z : That sounds like a great idea. But I suggest having both global injects using `hadoop.inject.*` or similar, and the prefixed injects that you mentioned, which would override the global ones. This allows to set the same access keys for example for all parts of the pipeline, instead of having to repeat similar configuration.
What API were you thinking about? Something like `HadoopUtils.newConfiguration(State, Optional<String> prefix)`?
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-168762668
jbaranick wrote on 2016-01-04T18:39:58Z : @ibuenros That works for me. That API signature seems fine. Can you speak to: It would be really helpful for me to understand the logic and reasoning behind newConfiguration(), getConfFromState(State state), and putStateIntoConfiguration(State state, Configuration configuration).
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-168763182
ibuenros wrote on 2016-01-04T18:42:49Z : I don't think there is logic / reasoning other than people who implemented each of them were unaware of the existence of the other methods. The idea would be to replace all similar calls for the method in `HadoopUtils`.
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-168763847
jbaranick wrote on 2016-01-04T18:43:27Z : Works for me.
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-168764006
ibuenros wrote on 2016-01-04T18:45:30Z : So are you currently / planning to work on this? Or should we try to schedule it in the LinkedIn team?
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-168764510
jbaranick wrote on 2016-01-11T23:51:33Z : @ibuenros It depends on how quick you want this done. I'm waiting on a couple of open PRs for gobblin to be merged and then I may be able to get to this.
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-170735227
jbaranick wrote on 2016-02-12T06:30:26Z : @ibuenros I think that maybe we can punt on this for a bit until your new config system is in place (or it bubbles up in priority). What do you think?
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-183199151
ibuenros wrote on 2016-02-12T17:58:51Z : I don't think they are too related to each other (I may be wrong), but we have no particular rush on this issue, so it is up to you.
Github Url : https://github.com/linkedin/gobblin/issues/532#issuecomment-183430833