Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.14.2
    • Fix Version/s: 0.15.0
    • Component/s: documentation
    • Labels:
      None

      Description

      I'd like to put forward some thoughts on how to structure reasonably detailed documentation for hadoop.

      Essentially I think of atleast 3 different profiles to target:

      • hadoop-dev, folks who are actively involved improving/fixing hadoop.
      • hadoop-user
        • mapred application writers and/or folks who directly use hdfs
        • hadoop cluster administrators

      For this issue, I'd like to first target the latter category (admin and hdfs/mapred user) - where, arguably, is the biggest bang for the buck, right now.
      There is a crying need to get user-level stuff documented, judging by the sheer no. of emails we get on the hadoop lists...


      1. Installing/Configuration Guides

      This set of documents caters to folks ranging from someone just playing with hadoop on a single-node to operations teams who administer hadoop on several nodes (thousands). To ensure we cover all bases I'm thinking along the lines of:

      • Download, install and configure hadoop on a single-node cluster: including a few comments on how to run examples (word-count) etc.
      • Admin Guide: Install and configure a real, distributed cluster.
      • Tune Hadoop: Separate sections on how to tune hdfs and map-reduce, targeting power admins/users.

      I reckon most of this would be done via forrest, with appropriate links to javadoc.

      2. User Manual

      This set is geared for people who use hdfs and/or map-reduce per-se. Stuff to document:

      • Write a really simple mapred application, just fitting the blocks together i.e. maybe a walk-through of a couple of examples like word-count, sort etc.
      • Detailed information on important map-reduce user-interfaces:
        • JobConf
        • JobClient
        • Tool & ToolRunner
        • InputFormat
          • InputSplit
          • RecordReader
        • Mapper
        • Reducer
        • Reporter
        • OutputCollector
        • Writable
        • WritableComparable
        • OutputFormat
        • DistributedCache
      • SequenceFile
        • Compression types: NONE, RECORD, BLOCK
      • Hadoop Streaming
      • Hadoop Pipes

      I reckon most of this would land up in the javadocs, specifically package.html and some via forrest.


      Also, as discussed in HADOOP-1881, it would be quite useful to maintain documentation per-release, even on the hadoop website i.e. we could have a main documentation page link to documentation per-release and to the trunk.


      Thoughts?

      1. HADOOP-2046_1_20071018.patch
        65 kB
        Arun C Murthy
      2. HADOOP-2046_2_20071022.patch
        149 kB
        Arun C Murthy
      3. HADOOP-2046_3_20071023.patch
        161 kB
        Arun C Murthy
      4. HADOOP-2046_4_20071025.patch
        162 kB
        Arun C Murthy

        Issue Links

          Activity

          Hide
          Owen O'Malley added a comment -

          I'd also like to see:
          side effect files
          0-reduce jobs
          secondary sort keys
          allowable failure percents

          Show
          Owen O'Malley added a comment - I'd also like to see: side effect files 0-reduce jobs secondary sort keys allowable failure percents
          Hide
          Arun C Murthy added a comment -

          Thanks Owen, I'll try and find an apt home for those kinds of details.


          Anything else, anyone? Please throw them here:

          io.sort.mb
          mapred.task.id

          Show
          Arun C Murthy added a comment - Thanks Owen, I'll try and find an apt home for those kinds of details. Anything else, anyone? Please throw them here: io.sort.mb mapred.task.id
          Hide
          Amar Kamat added a comment -

          I think a separate section on FAQ should also be there based on the most common questions asked by the folks. For users it could be

          • I am getting heap errors what should I do
          • etc

            The list of questions can be discussed and then compiled. As for the dev guys the simplest way to answer the query would be to point to a relevant JIRA issue. These questions might not be based on commonly asked but based on the inputs from senior developers.

          Show
          Amar Kamat added a comment - I think a separate section on FAQ should also be there based on the most common questions asked by the folks. For users it could be I am getting heap errors what should I do etc The list of questions can be discussed and then compiled. As for the dev guys the simplest way to answer the query would be to point to a relevant JIRA issue. These questions might not be based on commonly asked but based on the inputs from senior developers.
          Hide
          Enis Soztutar added a comment -

          I give a big +1 to a step-by-step mapred tutorial as a part of the user manual. The tutorial may start with a mapper/reducer implementation and advance through other features such as InputFormat, RecordReader, etc.

          As for the developers guide i would like to see some UML diagrams for the parts of the kernel, including sequence diagram for task execution, and class diagrams.

          Show
          Enis Soztutar added a comment - I give a big +1 to a step-by-step mapred tutorial as a part of the user manual. The tutorial may start with a mapper/reducer implementation and advance through other features such as InputFormat, RecordReader, etc. As for the developers guide i would like to see some UML diagrams for the parts of the kernel, including sequence diagram for task execution, and class diagrams.
          Hide
          Nigel Daley added a comment -

          +1.

          It would also be helpful to replace the out-dated wiki page
          http://wiki.apache.org/lucene-hadoop/DevelopmentCommandLineOptions
          with a clearer and more verbose man page for bin/hadoop user commands.

          Show
          Nigel Daley added a comment - +1. It would also be helpful to replace the out-dated wiki page http://wiki.apache.org/lucene-hadoop/DevelopmentCommandLineOptions with a clearer and more verbose man page for bin/hadoop user commands.
          Hide
          Enis Soztutar added a comment -

          We see lots of newcomers questions about what HDFS offers, and how it performs comparing to other (distributed) file system implementations. Maybe a [feature] comparison [table] may help users understand what hadoop is and what it isn't.

          Show
          Enis Soztutar added a comment - We see lots of newcomers questions about what HDFS offers, and how it performs comparing to other (distributed) file system implementations. Maybe a [feature] comparison [table] may help users understand what hadoop is and what it isn't.
          Hide
          Arun C Murthy added a comment -

          Here is a quick heads-up with updated javadocs for Configuration (from HADOOP-1881), JobConf, JobClient, RunningJob and ClusterStatus...

          Show
          Arun C Murthy added a comment - Here is a quick heads-up with updated javadocs for Configuration (from HADOOP-1881 ), JobConf, JobClient, RunningJob and ClusterStatus...
          Hide
          Doug Cutting added a comment -

          Overall this looks great. A few comments:

          • In Configuration.java, the first use of 'final' should be in italics, not bold, and the anchors in the headers should be done with <h4 id=foo>Foo</h4>. I also find the links to String and Path mostly just introduce noise. We might make the first reference to Path a link, but leave the rest as plain text: no one is going to click on that link to find out what a Java String is, nor do we need more than a single link to Path.
          • In JobClient.java, the anchors should be implemented with 'id='. We should not mention HDFS here: the system directory could be in, e.g., KFS. I would also leave the internally used file names "job.jar" and "job.xml" out of this description. The list of things done should include 'submission of the job to the jobtracker'. The steps you list are all preparations for that, but we don't want to forget that crucial step. In the list of ways to handle job sequencing, it should be made more clear that these are alternatives: one should choose just one method. Also, should we mention the jobcontrol stuff here?
          • in JobConf.java: the JobConf isn't XML. It can be serialized as XML, but it's fundamentally a Map<String,String>, a Configuration. We also have anchors that should use 'id=' here, and mentions of HDFS that should be instead just be to FileSystem (all FileSystem's have a block size, that's used to generate splits). And, instead of 'default InputFormat' we should say 'standard file-based InputFormats'. We should probably also include something at the top-level in this class about the determination of job jar file.
          Show
          Doug Cutting added a comment - Overall this looks great. A few comments: In Configuration.java, the first use of 'final' should be in italics, not bold, and the anchors in the headers should be done with <h4 id=foo>Foo</h4>. I also find the links to String and Path mostly just introduce noise. We might make the first reference to Path a link, but leave the rest as plain text: no one is going to click on that link to find out what a Java String is, nor do we need more than a single link to Path. In JobClient.java, the anchors should be implemented with 'id='. We should not mention HDFS here: the system directory could be in, e.g., KFS. I would also leave the internally used file names "job.jar" and "job.xml" out of this description. The list of things done should include 'submission of the job to the jobtracker'. The steps you list are all preparations for that, but we don't want to forget that crucial step. In the list of ways to handle job sequencing, it should be made more clear that these are alternatives: one should choose just one method. Also, should we mention the jobcontrol stuff here? in JobConf.java: the JobConf isn't XML. It can be serialized as XML, but it's fundamentally a Map<String,String>, a Configuration. We also have anchors that should use 'id=' here, and mentions of HDFS that should be instead just be to FileSystem (all FileSystem's have a block size, that's used to generate splits). And, instead of 'default InputFormat' we should say 'standard file-based InputFormats'. We should probably also include something at the top-level in this class about the determination of job jar file.
          Hide
          Doug Cutting added a comment -

          Upgrading to blocker. We shouldn't release 0.15.0 without these much-needed documentation improvements.

          Show
          Doug Cutting added a comment - Upgrading to blocker. We shouldn't release 0.15.0 without these much-needed documentation improvements.
          Hide
          Owen O'Malley added a comment -

          I agree that this is good overall. More items:

          • In Configuration, the proper way to get un-substituted values is with getRaw, not getObject, which is deprecated.
          • I'd add a better discussion of the set/getOutputValueGroupingComparator. Something like my message to hadoop-user on the topic:

            There is not a guarantee of the reduce sort being stable in any sense. (WIth the non-deterministic order of the map outputs being available to the reduce, it wouldn't make that much sense.)

            There certainly isn't enough documentation about what is allowed for sorting. I've filed a bug HADOOP-1981 to expand the Reducer java doc to mention the JobConf methods that can control the sort order. In particular, the methods are:

            setOutputKeyComparatorClass
            setOutputValueGroupingComparator

            The first comparator controls the sort order of the keys. The second controls which keys are grouped together into a single call to the reduce method. The combination of these two allows you to set up jobs that act like you've defined an order on the values.

            For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like:

            Map Input Key: url
            Map Input Value: document
            Map Output Key: document checksum, url pagerank
            Map Output Value: url
            Partitioner: by checksum
            OutputKeyComparator: by checksum and then decreasing pagerank
            OutputValueGroupingComparator: by checksum

            with this setup, the reduce function will be called exactly once with each checksum, but the first value from the iterator will be the one with the highest pagerank, which can then be used to tag the other entries of the checksum family.

          Show
          Owen O'Malley added a comment - I agree that this is good overall. More items: In Configuration, the proper way to get un-substituted values is with getRaw, not getObject, which is deprecated. I'd add a better discussion of the set/getOutputValueGroupingComparator. Something like my message to hadoop-user on the topic: There is not a guarantee of the reduce sort being stable in any sense. (WIth the non-deterministic order of the map outputs being available to the reduce, it wouldn't make that much sense.) There certainly isn't enough documentation about what is allowed for sorting. I've filed a bug HADOOP-1981 to expand the Reducer java doc to mention the JobConf methods that can control the sort order. In particular, the methods are: setOutputKeyComparatorClass setOutputValueGroupingComparator The first comparator controls the sort order of the keys. The second controls which keys are grouped together into a single call to the reduce method. The combination of these two allows you to set up jobs that act like you've defined an order on the values. For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like: Map Input Key: url Map Input Value: document Map Output Key: document checksum, url pagerank Map Output Value: url Partitioner: by checksum OutputKeyComparator: by checksum and then decreasing pagerank OutputValueGroupingComparator: by checksum with this setup, the reduce function will be called exactly once with each checksum, but the first value from the iterator will be the one with the highest pagerank, which can then be used to tag the other entries of the checksum family.
          Hide
          Arun C Murthy added a comment -

          Here is another stab at fixing javadocs for org.apache.hadoop.mapred... all user-interfaces.

          Show
          Arun C Murthy added a comment - Here is another stab at fixing javadocs for org.apache.hadoop.mapred... all user-interfaces.
          Hide
          Arun C Murthy added a comment -

          Embellished patch, a tad.

          Show
          Arun C Murthy added a comment - Embellished patch, a tad.
          Hide
          Arun C Murthy added a comment -

          Updated patch.

          I'll file another jira for some more documentation via forrest which allows these to go into 0.15.0. The forrest is almost done too but it doesn't have to block 0.15.0 since the hadoop website is the trunk and can be updated as soon as that patch goes in.

          Show
          Arun C Murthy added a comment - Updated patch. I'll file another jira for some more documentation via forrest which allows these to go into 0.15.0. The forrest is almost done too but it doesn't have to block 0.15.0 since the hadoop website is the trunk and can be updated as soon as that patch goes in.
          Hide
          Arun C Murthy added a comment -

          A javadoc warning to fix.

          Show
          Arun C Murthy added a comment - A javadoc warning to fix.
          Hide
          Doug Cutting added a comment -

          +1 This looks good to me. I will commit this unless someone objects.

          Show
          Doug Cutting added a comment - +1 This looks good to me. I will commit this unless someone objects.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Arun!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Arun!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #282 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/282/ )
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #283 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/283/ )

            People

            • Assignee:
              Arun C Murthy
              Reporter:
              Arun C Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development