Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2046

Documentation: improve mapred javadocs



    • Improvement
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.14.2
    • 0.15.0
    • documentation
    • None


      I'd like to put forward some thoughts on how to structure reasonably detailed documentation for hadoop.

      Essentially I think of atleast 3 different profiles to target:

      • hadoop-dev, folks who are actively involved improving/fixing hadoop.
      • hadoop-user
        • mapred application writers and/or folks who directly use hdfs
        • hadoop cluster administrators

      For this issue, I'd like to first target the latter category (admin and hdfs/mapred user) - where, arguably, is the biggest bang for the buck, right now.
      There is a crying need to get user-level stuff documented, judging by the sheer no. of emails we get on the hadoop lists...

      1. Installing/Configuration Guides

      This set of documents caters to folks ranging from someone just playing with hadoop on a single-node to operations teams who administer hadoop on several nodes (thousands). To ensure we cover all bases I'm thinking along the lines of:

      • Download, install and configure hadoop on a single-node cluster: including a few comments on how to run examples (word-count) etc.
      • Admin Guide: Install and configure a real, distributed cluster.
      • Tune Hadoop: Separate sections on how to tune hdfs and map-reduce, targeting power admins/users.

      I reckon most of this would be done via forrest, with appropriate links to javadoc.

      2. User Manual

      This set is geared for people who use hdfs and/or map-reduce per-se. Stuff to document:

      • Write a really simple mapred application, just fitting the blocks together i.e. maybe a walk-through of a couple of examples like word-count, sort etc.
      • Detailed information on important map-reduce user-interfaces:
        • JobConf
        • JobClient
        • Tool & ToolRunner
        • InputFormat
          • InputSplit
          • RecordReader
        • Mapper
        • Reducer
        • Reporter
        • OutputCollector
        • Writable
        • WritableComparable
        • OutputFormat
        • DistributedCache
      • SequenceFile
        • Compression types: NONE, RECORD, BLOCK
      • Hadoop Streaming
      • Hadoop Pipes

      I reckon most of this would land up in the javadocs, specifically package.html and some via forrest.

      Also, as discussed in HADOOP-1881, it would be quite useful to maintain documentation per-release, even on the hadoop website i.e. we could have a main documentation page link to documentation per-release and to the trunk.



        1. HADOOP-2046_1_20071018.patch
          65 kB
          Arun Murthy
        2. HADOOP-2046_2_20071022.patch
          149 kB
          Arun Murthy
        3. HADOOP-2046_3_20071023.patch
          161 kB
          Arun Murthy
        4. HADOOP-2046_4_20071025.patch
          162 kB
          Arun Murthy

        Issue Links



              acmurthy Arun Murthy
              acmurthy Arun Murthy
              0 Vote for this issue
              1 Start watching this issue