Issue Details (XML | Word | Printable)

Key: MAPREDUCE-475
Type: Improvement Improvement
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: Aaron Kimball
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Map/Reduce

JobConf should validate key names in well-defined namespaces and warn on misspelling

Created: 21/Feb/08 04:12 AM   Updated: 20/Jun/09 07:57 AM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Original Estimate: 72h
Original Estimate - 72h
Remaining Estimate: 72h
Remaining Estimate - 72h
Time Spent: Not Specified
Remaining Estimate - 72h

Issue Links:
Reference
 


 Description  « Hide
A discussion on the mailing list reveals that some configuration strings in the JobConf are deprecated over time and new configuration names replace them:

e.g., "mapred.output.compression.type" is now replaced with "mapred.map.output.compression.type"

Programmers who have been manually specifying the former string, however, receive no diagnostic output during testing to suggest that their compression type is being silently ignored.

It would be desirable to notify developers of this change by printing a warning message when deprecated configuration names are used in a newer version of Hadoop. More generally, when any configuration string in the mapred.*, fs.*, dfs.*, etc namespaces are provided by a user and are not recognized by Hadoop, it is desirable to print a warning, to indicate malformed configurations. No warnings should be printed when configuration keys are in user-defined namespaces (e.g., "myprogram.mytask.myvalue").



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Aaron Kimball added a comment - 21/Feb/08 07:00 AM
I surveyed the hadoop TRUNK source by using Eclipse's Java Search; I looked for all instances of org.apache.hadoop.mapred.JobConf.set* and org.apache.hadoop.mapred.JobConf.get* methods.

I believe the following to be a complete list of key names explicitly set within the hadoop Java source :
Any entries with a "@@@" postscript are keys that are set only in @deprecated methods.

aggregator.descriptor. {i\*} <--- numeric end stem...
aggregator.descriptor.num
dfs.datanode.rack
dfs.datanode.startup
dfs.http.address
dfs.namenode.startup
dfs.secondary.http.bindAddress
dfs.umask
fs.default.name
fs.hdfs.impl
hadoop.native.lib
hadoop.pipes.command-file.keep
hadoop.pipes.executable
hadoop.pipes.java.mapper
hadoop.pipes.java.recordreader
hadoop.pipes.java.recordwriter
hadoop.pipes.java.reducer
hadoop.pipes.partitioner
io.compression.codecs
io.seqfile.compression.type
ipc.client.connection.maxidletime
ipc.client.timeout
job.end.notification.url
jobclient.output.filter
keep.failed.task.files
keep.task.files.pattern
map.input.file
map.input.length
map.input.start
mapred.cache.archives
mapred.cache.archives.timestamps
mapred.cache.files
mapred.cache.files.timestamps
mapred.cache.localArchives
mapred.cache.localFiles
mapred.combiner.class
mapred.compress.map.output
mapred.create.symlink
mapred.input.dir
mapred.input.format.class
mapred.input.key.class @@@
mapred.input.value.class @@@
mapred.jar
mapred.job.classpath.archives
mapred.job.classpath.files
mapred.job.history.http.bindAddress
mapred.job.id
mapred.job.name
mapred.job.priority
mapred.job.split.file
mapred.job.tracker
mapred.job.tracker.http.bindAddress
mapred.local.dir
mapred.map.max.attempts
mapred.map.output.compression.codec
mapred.map.output.compression.type
mapred.map.runner.class
mapred.map.task.debug.script
mapred.map.tasks
mapred.map.tasks.speculative.execution
mapred.mapoutput.key.class
mapred.mapoutput.value.class
mapred.mapper.class
mapred.mapper.regex
mapred.max.map.failures.percent
mapred.max.reduce.failures.percent
mapred.max.tracker.failures
mapred.min.split.size
mapred.output.compress
mapred.output.compression.codec
mapred.output.compression.type
mapred.output.dir
mapred.output.format.class
mapred.output.key.class
mapred.output.key.comparator.class
mapred.output.value.class
mapred.output.value.groupfn.class
mapred.partitioner.class
mapred.reduce.max.attempts
mapred.reduce.task.debug.script
mapred.reduce.task.debug.script
mapred.reduce.tasks
mapred.reduce.tasks.speculative.execution
mapred.reducer.class
mapred.reducer.separator
mapred.reducer.sort
mapred.speculative.execution @@@
mapred.task.id
mapred.task.is.map
mapred.task.partition
mapred.task.profile
mapred.task.profile.maps
mapred.task.profile.reduces
mapred.task.tracker.report.bindAddress
mapred.tip.id
mapred.working.dir
sequencefile.filter.class
sequencefile.filter.frequency
sequencefile.filter.regex
session.id
user.name


And the following keys are explicitly retrieved by a get*() method somewhere in the Java source:

aggregate.max.num.unique.values
aggregator.descriptor.{i*}} <---- ints
aggregator.descriptor.num
create.empty.dir.if.nonexist
dfs.balance.bandwidthPerSec
dfs.block.size
dfs.blockreport.initialDelay
dfs.blockreport.intervalMsec
dfs.client.block.write.retries
dfs.data.dir
dfs.datanode.address
dfs.datanode.bindAddress
dfs.datanode.block.write.timeout.sec
dfs.datanode.dns.interface
dfs.datanode.dns.nameserver
dfs.datanode.du.pct
dfs.datanode.du.reserved
dfs.datanode.http.address
dfs.datanode.info.bindAddress
dfs.datanode.info.port
dfs.datanode.numblocks
dfs.datanode.port
dfs.datanode.rack
dfs.datanode.scan.period.hours
dfs.datanode.simulateddatastorage
dfs.datanode.startup
dfs.default.chunk.view.size
dfs.df.interval
dfs.heartbeat.interval
dfs.hosts
dfs.hosts.exclude
dfs.http.address
dfs.info.bindAddress
dfs.info.port
dfs.max-repl-streams
dfs.max.objects
dfs.name.dir
dfs.namenode.decommission.interval
dfs.namenode.handler.count
dfs.namenode.startup
dfs.network.script
dfs.permissions.
dfs.permissions.supergroup
dfs.read.prefetch.size
dfs.replication
dfs.replication.considerLoad
dfs.replication.interval
dfs.replication.max
dfs.replication.min
dfs.replication.pending.timeout.sec
dfs.safemode.extension
dfs.safemode.threshold.pct
dfs.secondary.http.address
dfs.secondary.info.bindAddress
dfs.secondary.info.port
dfs.socket.timeout
dfs.umask
dfs.upgrade.permission
dfs.web.ugi
fs.*.impl <-------wildcard allowed for URI-based scheme slot
fs.checkpoint.dir
fs.checkpoint.period
fs.checkpoint.size
fs.default.name
fs.inmemory.size.mb
fs.kfs.metaServerHost
fs.kfs.metaServerPort
fs.local.block.size
fs.s3.awsAccessKeyId
fs.s3.awsSecretAccessKey
fs.s3.buffer.dir
fs.s3.maxRetries
fs.s3.sleepTimeSeconds
fs.trash.interval
hadoop.job.history.location
hadoop.job.history.user.location
hadoop.native.lib
hadoop.pipes.command-file.keep
hadoop.pipes.executable
hadoop.pipes.java.mapper
hadoop.pipes.java.recordreader
hadoop.pipes.java.recordwriter
hadoop.pipes.java.reducer
hadoop.pipes.partitioner
hadoop.rpc.socket.factory.class.* <-- wildcard
hadoop.rpc.socket.factory.class.default
hadoop.socks.server
heartbeat.recheck.interval
io.bytes.per.checksum
io.compression.codec.lzo.buffersize
io.compression.codec.lzo.compressor
io.compression.codec.lzo.decompressor
io.compression.codecs
io.file.buffer.size
io.map.index.skip
io.seqfile.compress.blocksize
io.seqfile.compression.type
io.skip.checksum.errors
io.sort.factor
io.sort.mb
ipc.client.connect.max.retries
ipc.client.connection.maxidletime
ipc.client.idlethreshold
ipc.client.kill.max
ipc.client.maxidletime
ipc.client.tcpnodelay
ipc.client.timeout
ipc.server.listen.queue.size
java.library.path
job.end.notification.url
job.end.retry.attempts
job.end.retry.interval
jobclient.output.filter
keep.failed.task.files
keep.task.files.pattern
key.value.separator.in.input.line
local.cache.size
map.input.file
map.output.key.field.separator
map.output.key.value.fields.spec
map.sort.class
mapred.cache.archives
mapred.cache.archives.timestamps
mapred.cache.files
mapred.cache.files.timestamps
mapred.cache.localArchives
mapred.cache.localFiles
mapred.child.java.opts
mapred.child.tmp
mapred.combiner.class
mapred.compress.map.output
mapred.create.symlink
mapred.data.field.separator
mapred.debug.out.lines
mapred.hosts
mapred.hosts.exclude
mapred.inmem.merge.threshold
mapred.input.dir
mapred.input.format.class
mapred.input.key.class @@@
mapred.input.value.class @@@
mapred.jar
mapred.job.classpath.archives
mapred.job.classpath.files
mapred.job.history.http.bindAddress
mapred.job.id
mapred.job.name
mapred.job.priority
mapred.job.split.file
mapred.job.tracker
mapred.job.tracker.handler.count
mapred.job.tracker.http.address
mapred.job.tracker.info.bindAddress
mapred.job.tracker.info.port
mapred.job.tracker.persist.jobstatus.active
mapred.job.tracker.persist.jobstatus.dir
mapred.job.tracker.persist.jobstatus.hours
mapred.jobtracker.completeuserjobs.maximum
mapred.jobtracker.retirejob.check"
mapred.jobtracker.retirejob.interval
mapred.jobtracker.taskalloc.capacitypad
mapred.jobtracker.taskalloc.loadbalance.epsilon
mapred.join.expr
mapred.join.keycomparator
mapred.local.dir
mapred.local.dir.minspacekill
mapred.local.dir.minspacestart
mapred.map.max.attempts
mapred.map.multithreadedrunner.threads
mapred.map.output.compression.codec
mapred.map.output.compression.type
mapred.map.runner.class
mapred.map.task.debug.script
mapred.map.tasks
mapred.map.tasks.speculative.execution
mapred.mapoutput.key.class
mapred.mapoutput.value.class
mapred.mapper.class
mapred.mapper.regex
mapred.mapper.regex.group
mapred.max.map.failures.percent
mapred.max.reduce.failures.percent
mapred.max.tracker.failures
mapred.min.split.size
mapred.output.compress
mapred.output.compression.codec
mapred.output.compression.type
mapred.output.dir
mapred.output.format.class
mapred.output.key.class
mapred.output.key.comparator.class
mapred.output.value.class
mapred.output.value.groupfn.class
mapred.partitioner.class
mapred.reduce.copy.backoff
mapred.reduce.max.attempts
mapred.reduce.parallel.copies
mapred.reduce.task.debug.script
mapred.reduce.tasks
mapred.reduce.tasks.speculative.execution
mapred.reducer.class
mapred.reducer.separator
mapred.reducer.sort
mapred.speculative.execution
mapred.speculative.execution @@@
mapred.submit.replication
mapred.system.dir
mapred.task.id
mapred.task.is.map
mapred.task.partition
mapred.task.profile
mapred.task.profile.maps
mapred.task.profile.reduces
mapred.task.timeout
mapred.task.tracker.http.address
mapred.task.tracker.report.address
mapred.task.tracker.report.bindAddress
mapred.task.tracker.report.port
mapred.tasktracker.dns.interface
mapred.tasktracker.dns.nameserver
mapred.tasktracker.expiry.interval
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
mapred.tasktracker.tasks.maximum
mapred.tip.id
mapred.userlog.limit.kb
mapred.userlog.retain.hours
mapred.working.dir
num.key.fields.for.partition
reduce.output.key.value.fields.spec
sequencefile.filter.class
sequencefile.filter.frequency
sequencefile.filter.regex
session.id
tasktracker.contention.tracking
tasktracker.http.bindAddress
tasktracker.http.port
tasktracker.http.threads
user.jar.file
user.name

This does not cover HBase.

I propose an interface of the following methods:

class JobConfValidator { public JobConfValidator(); /** @return true if all keys are typed correctly using the spec in the issue description */ bool validateConfig(JobConf conf, bool printWarnings); /** @return true if this key is typed correctly using the spec in issue description */ bool validateKeyName(String key); /** @return true if key name begins with mapred.*, fs.*, etc. */ bool keyIsInReservedNamespace(String key); }

This can be mostly implemented by checking key names in the JobConf against a simple hashmap of key names and namespace names (mapred., fs., io., etc.).
Some bonus code would be necessary to handle the two keys that have arbitrary integer suffixes, and the key that allows wildcards in the middle (fs.*.impl).

An open question: Should all top-level key prefixes in the list above be considered namespaces fully-reserved by Hadoop? e.g., "session.id" and "user.name" are single-entry namespaces. Is all of session.* and user.* reserved?

Awaiting comments on these issues. If people like, I can code this up.


Joydeep Sen Sarma added a comment - 21/Feb/08 07:27 AM
+1. But ..:

the mechanism proposed here - by itself - will not prevent regressions. we can create a 'registry' of legal options - but that will not prevent a software change that adds a new option, moves semantics of old option to the new option - and then merrily leaves the old option behind in the registry.

(exactly what happened here. the tree has good looking code with references to deprecated options).

we would need isolated regression tests against each registered option (tall order indeed) for this to suffice.

besides - even the Java interfaces were obsoleted or their semantics changed without removing old interfaces. and that's not a 'lack of mechanism' issue - it's a pure software engineering issue.


Aaron Kimball added a comment - 21/Feb/08 08:28 AM
Joydeep,

You're definitely correct.

But in general, there are several problems with the JobConf system from a software engineering point of view:

1) Naming conventions don't exist. foo.bar.camelBaz, foo.bar.noncamelbaz, and foo.bar.dots.between.each.word are all used
2) The hierarchy imposed by the keys in the JobConfs have nothing to do with which modules actually use them. Two isolated modules can both depend on the same key for arbitrarily different functionality, tying one another together – and no system exists to prevent this.
3) The hierarchy is arbitrarily ignored: why does "map.input.file" exist, when there is already an established "mapred.map" hierarchy? What is the difference between "hadoop.job".* and "job.*" ? Shouldn't everything in the entire system technically be hadoop.* ?
4) Most config options are hardcoded throughout the source as raw strings; they are not placed in public static final Strings at the head of the dependent class, nor are they "registered" in any way with JobConf.

I think that a major refactoring of JobConf & friends is probably necessary to address all these issues. Furthermore, coding standards need to address formatting and hierarchy of config strings and approach this from the human side.

So for starters we can:
1) Add this mechanism, which at the very least will catch typos in user configurations
2) Encourage people who commit user patches to develop and enforce guidelines for naming conventions
3) Encourage people who commit user patches to require that patches update the JobConfValidator if they deprecate key names.

And longer-term, I may file another JIRA to address the rest of this.


Owen O'Malley added a comment - 21/Feb/08 05:28 PM
For every variable there should be an API that sets and gets the variable. That is much more stable than having a registry and they can be tracked by the API scanning tools.

And yes, the naming of the variables is adhoc and should (in theory!) be cleaned up. The problem with cleaning them up is exactly the people that don't always use the APIs for setting and getting the values...


Owen O'Malley added a comment - 21/Feb/08 06:07 PM
I don't think this meets the requirement of being a critical bug fix.

Doug Cutting added a comment - 21/Feb/08 07:10 PM
As others have stated, the preferred style should not be for programs to directly set these, but rather to use accessor methods. Such accessor methods should themselves be implemented with a constant to name the configuration parameter. It would be good to add warnings to Hudson when code accesses a parameter with a literal string, and I encourage patches which address this issue.

However this does not address parameters specified on the command line or in hadoop-site.xml. A registry would be required to properly check these. Global registries are hard to maintain and I am thus reluctant to introduce a new one. But we already have a global registry in hadoop-default.xml. Perhaps we should emit warnings if folks set properties in hadoop-site.xml that do not have a default value in hadoop-defaults.xml. Would that help?


Owen O'Malley added a comment - 21/Feb/08 07:19 PM
That could work if you only close the hadoop namespaces, I guess. (The ability of applications to define new attributes is pretty critical for map/reduce.) I guess doing that would be a good push for normalizing the names to something like having "hadoop." as a prefix for every framework property.

Nigel Daley added a comment - 21/Feb/08 07:56 PM
As Joydeep points out, we've also seen incompatibilities introduced by changing the semantics of an existing property. I think we need to be better about deprecating properties (and their accessor methods) and introducing new ones when a semantic change is required. Effectively, semantic changes require a new property. Enforcing this thru testing will require specific tests for each properties current semantics (not a small effort) so I think for now this should be added to our code review guidelines:
http://wiki.apache.org/hadoop/CodeReviewChecklist

Aaron Kimball added a comment - 22/Feb/08 07:41 AM
@Doug: Adding a warning on hadoop-site defining a value absent from hadoop-defaults is a good idea.

@Owen: If we performed a major refactoring of command names, we could perhaps add a very small checklist registry in JobConf, which only examines the top-level component of the key being set. So if we move all of mapred.* into hadoop.mapred.*, for example, the top-level namespaces "mapred", "fs", "dfs", "io", etc. could be made to trigger a warning that the key has been deprecated (and the subsequent release could fail with an error that it is now no-longer used); setting values inside "hadoop.*" is definitely ok. Setting values outside this family of namespaces entirely (user-defined keys) also would provoke no warning.

So what particular actions do we want to take?

Any/all of the following?

1) Add accessor/setter methods for all entries to JobConf and refactor all classes to use them (yikes)

2) Move all key names into hadoop.*
2a) Determine more sane and consistent names for the key names as long as we're moving them? (this requires a definition of what constitutes 2nd- and 3rd-level namespaces)

3) Add a top-level key namespace deprecation warning based on hard-coded list of deprecated top-level namespaces

4) divorce set(k, v) from setHadoop(k, v); only setHadoop() can set values in hadoop.*; it can be private to JobConf, which will force the use of JobConf-blessed method names (note: in the first release, set() will allow setting of hadoop.*, but will print a deprecation warning; the next release will ban this behavior.)

5) Something else..?


Doug Cutting added a comment - 22/Feb/08 08:55 PM
We should not draw a hard-line between "system" parameters which would be well-checked and "user" parameters, where all bets are off. Hadoop is included as a component in larger systems, so any boundary must be extensible. So if there's a namespace boundary, one should be able to move it to include new namespaces, not hard-code it. And if hadoop-default,xml defines a boundary, then we'd need to permit folks who layer systems to somehow add things to the defaults, e.g., by resurrecting Configuration#addDefaultResource().

Renaming everything would break lots of folks, and I don't see that namespace boundaries are required. We can cover cases where things are set by code by adding accessor methods, and we can cover cases where things are set in files by checking against the (extensible) defaults. That covers all the cases w/o breaking every existing config file.

So my recipe would be:

  • fix all code to use accessor methods and constants, never literal property names
  • enforce this in Hudson
  • resurrect Configuration#addDefaultResource()
  • when loading non-default config files, warn unless overriding a default.

Adding accessors is a big job, but it's long overdue & entirely back-compatible.


Aaron Kimball added a comment - 22/Feb/08 09:02 PM
Fair enough. But what, if anything, should be done to address users specifying deprecated parameters in their programs?

Doug Cutting added a comment - 22/Feb/08 09:15 PM
> what, if anything, should be done to address users specifying deprecated parameters in their programs?

When we deprecate or remove a parameter, we should replace accesses to it with code that generates a warning or error if the parameter is still specified.


Aaron Kimball added a comment - 26/Feb/08 02:32 PM
Doug: I meant by using the manual get()/set() methods – not through a specific accessor method for the parameter. Unless you're suggesting that we fill both of those with large lists of if..then tests :\

So if we're supposed to migrate toward accessor methods and constants... do you have particular names for these methods in mind? Or should someone (maybe me?) just get in there and start adding them?


Doug Cutting added a comment - 26/Feb/08 09:07 PM
> Doug: I meant by using the manual get()/set() methods - not through a specific accessor method [...]

I don't have a proposal for how to catch those automatically. But we can clean up the core code considerably, providing a much better example for user code to model.

> do you have particular names for these methods in mind?

The style I prefer is to add static methods to the most relevant public class. Examples of these are:

  • SequenceFile#get/setCompressionType
  • DistributedCache#setCacheArchives. et al
  • SequenceFileInputFilter#setFilterClass et al
  • HADOOP-1967

Perhaps we should add annotations to such accessor methods, naming the parameter that the get or set, then we could use 'apt', the annotation processor, to build a list of all of these for the documentation?


Ryan Smith added a comment - 28/Aug/08 05:51 PM
+1

Instead of using :

jobConf.set(String key,String value)

I vote for using getters/setters (spring configurable) to pass just the value:

jobConf.setSomeProperty(String value)

And using constructors to instantiate objects (or at least offer it as an option).
Off hand, i know DistributedFileSystem() could benefit from a 2 arg constructor that calls initialize(arg1,arg2) so
DI frameworks like spring are supported.
I would be interested in submitting a patch soon for some of these config access issues on a one or two other classes.
Ill add the patch here or start a new issue for it, please let me know.
Thanks,

-Ryan