Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1, 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: examples
    • Labels:
      None
    • Environment:

      ~200 nodes cluster
      Each node has the following configuration:
      Processors: 2 x Xeon L5420 2.50GHz (8 cores) - Harpertown C0, 64-bit, quad-core (8 CPUs)
      4 Disks
      16 GB RAM
      Linux 2.6
      Hadoop version: trunk

    • Hadoop Flags:
      Reviewed

      Description

      Running the RandomTextWritter example job ( from the examples jar) pegs the machiens' CPUs.

      1. MR712-1.patch
        5 kB
        Chris Douglas
      2. MR712-0.patch
        6 kB
        Chris Douglas

        Activity

        Hide
        Khaled Elmeleegy added a comment -

        I used the following command line for the RandomTextWritter job:

        ./hadoop jar ../hadoop-0.21.0-dev-*examples.jar randomtextwriter -D test.randomtextwrite.total_bytes=53687091200000 -D test.randomtextwrite.bytes_per_map=536870912 -D test.randomtextwrite.min_words_key=5 -D test.randomtextwrite.max_words_key=10 -D test.randomtextwrite.min_words_value=100 -D test.randomtextwrite.max_words_value=10000 -D mapred.output.compress=false -D mapred.map.output.compression.type=BLOCK -outFormat org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat /gridmix/data/WebSimulationBlock

        The job has 100,000 maps and no reduces. I configured HDFS to have replication factor of 1 to eliminate network traffic. Nodes were configured to have 16 map slots and 2 reduce slots. Each task was configured to have at most 512MB of java heap space. The jobs output file is ~50TB >> overall cluster memory, forcing disk I/O

        Since the job is doing no computation other than just writing to disk, one would expect that the job would be totally i/o (disk) bound. Surprisingly, it turned out to be CPU bound.

        Measurements (using chukwa):

        Across the cluster, workers cpu was <5% idle on average. Used disk bandwidth was ~40 MB/s across all disks for all the nodes at the cluster, which is close to the practical disk BW limit. The network is virtually 100% idle as one would expect.

        The CPU time was ~70% at the user space, suggesting it's mainly overhead in the map tasks.

        This suggests that there is a lot of CPU fat in the map tasks.

        Show
        Khaled Elmeleegy added a comment - I used the following command line for the RandomTextWritter job: ./hadoop jar ../hadoop-0.21.0-dev-*examples.jar randomtextwriter -D test.randomtextwrite.total_bytes=53687091200000 -D test.randomtextwrite.bytes_per_map=536870912 -D test.randomtextwrite.min_words_key=5 -D test.randomtextwrite.max_words_key=10 -D test.randomtextwrite.min_words_value=100 -D test.randomtextwrite.max_words_value=10000 -D mapred.output.compress=false -D mapred.map.output.compression.type=BLOCK -outFormat org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat /gridmix/data/WebSimulationBlock The job has 100,000 maps and no reduces. I configured HDFS to have replication factor of 1 to eliminate network traffic. Nodes were configured to have 16 map slots and 2 reduce slots. Each task was configured to have at most 512MB of java heap space. The jobs output file is ~50TB >> overall cluster memory, forcing disk I/O Since the job is doing no computation other than just writing to disk, one would expect that the job would be totally i/o (disk) bound. Surprisingly, it turned out to be CPU bound. Measurements (using chukwa): Across the cluster, workers cpu was <5% idle on average. Used disk bandwidth was ~40 MB/s across all disks for all the nodes at the cluster, which is close to the practical disk BW limit. The network is virtually 100% idle as one would expect. The CPU time was ~70% at the user space, suggesting it's mainly overhead in the map tasks. This suggests that there is a lot of CPU fat in the map tasks.
        Hide
        Arun C Murthy added a comment -

        Khaled, do you have details on where the CPU is being consumed? Is it the map task? Did you profile the task to see where the CPU is being consumed?

        What about the datanode and the tasktracker?

        Show
        Arun C Murthy added a comment - Khaled, do you have details on where the CPU is being consumed? Is it the map task? Did you profile the task to see where the CPU is being consumed? What about the datanode and the tasktracker?
        Hide
        Khaled Elmeleegy added a comment -

        Well, top reports that ~170% (~1.7 cpus) of the time is spent at the data
        node, which makes sense, as it's receiving all these writes. The rest of the
        time is distributed evenly among tasks (maps), this part doesn't sound
        right...too much fat.

        One thing to add, when having replication factor of 3, the bottleneck shifts
        to become the network, no surprise there.

        Show
        Khaled Elmeleegy added a comment - Well, top reports that ~170% (~1.7 cpus) of the time is spent at the data node, which makes sense, as it's receiving all these writes. The rest of the time is distributed evenly among tasks (maps), this part doesn't sound right...too much fat. One thing to add, when having replication factor of 3, the bottleneck shifts to become the network, no surprise there.
        Hide
        Owen O'Malley added a comment -

        16 maps on 8 cpus running gzip is expected to completely saturate cpu. This is not a bug!!!

        Also check to see if you were using the native codec. If you are using the Java codec, it will be very slow and cpu bound.

        Show
        Owen O'Malley added a comment - 16 maps on 8 cpus running gzip is expected to completely saturate cpu. This is not a bug!!! Also check to see if you were using the native codec. If you are using the Java codec, it will be very slow and cpu bound.
        Hide
        Owen O'Malley added a comment -

        I notice now that you didn't have compression. I wonder how much time you were spending in gc with such small heaps. That might explain the cpu load.

        Show
        Owen O'Malley added a comment - I notice now that you didn't have compression. I wonder how much time you were spending in gc with such small heaps. That might explain the cpu load.
        Hide
        Arun C Murthy added a comment -

        I wonder how much time you were spending in gc with such small heaps. That might explain the cpu load.

        Agreed. You have 5G of data per map (50TB/100k maps) which results in a significant number of output Text objects being created in RandomTextWriter (a potential bug).
        Thus, we'd get a lot of data out of the profiles of the tasks...

        Show
        Arun C Murthy added a comment - I wonder how much time you were spending in gc with such small heaps. That might explain the cpu load. Agreed. You have 5G of data per map (50TB/100k maps) which results in a significant number of output Text objects being created in RandomTextWriter (a potential bug). Thus, we'd get a lot of data out of the profiles of the tasks...
        Hide
        Khaled Elmeleegy added a comment -

        Ah, I forgot to mention I turned off compression to make sure it's not a
        bottleneck. You can see that in the command line I used.

        Show
        Khaled Elmeleegy added a comment - Ah, I forgot to mention I turned off compression to make sure it's not a bottleneck. You can see that in the command line I used.
        Hide
        Hong Tang added a comment -

        Besides profiling, I guess two other things you might want to try to ensure the problems are not in either DFS or the specific input format you are using:

        • Run TestDFSIO benchmark to see make sure hdfs layer does not consume all the cpus.
        • Use a dummier writer that simply writes zeros to the files instead of wasting time generating random text keys and values (which uses StringBuffer and does UTF8 convertions).
        Show
        Hong Tang added a comment - Besides profiling, I guess two other things you might want to try to ensure the problems are not in either DFS or the specific input format you are using: Run TestDFSIO benchmark to see make sure hdfs layer does not consume all the cpus. Use a dummier writer that simply writes zeros to the files instead of wasting time generating random text keys and values (which uses StringBuffer and does UTF8 convertions).
        Hide
        Khaled Elmeleegy added a comment -

        Well, 50TB/100k maps is 512 MB/map. I also tried 256MB/map and the result is
        the same.

        Another interesting data point is I tried the read case, using the following
        command:

        ./hadoop jar ../hadoop-*test.jar loadgen -keepmap 0.2 -keepred 5
        -inFormat org.apache.hadoop.mapred.SequenceFileInputFormat -outFormat
        org.apache.hadoop.mapred.SequenceFileOutputFormat -outKey
        org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text -indir
        /gridmix/data/WebSimulationBlock -outdir /out

        My input files are those produced by the previous TextWriter job.

        Interestingly, for this case the disk is clearly the bottleneck. Worker's
        CPUs are ~50% idle + ~20% waiting for I/O, i.e. The CPUs are only 30% busy.
        All disks are virtually saturated.

        My guess for the discrepancy between the read and the write case is due to
        the fact that the datanode uses sendfile for the read case. Maps don't/can't
        use sendfile though for the write case. I'll try to look deeper into this.

        Show
        Khaled Elmeleegy added a comment - Well, 50TB/100k maps is 512 MB/map. I also tried 256MB/map and the result is the same. Another interesting data point is I tried the read case, using the following command: ./hadoop jar ../hadoop-*test.jar loadgen -keepmap 0.2 -keepred 5 -inFormat org.apache.hadoop.mapred.SequenceFileInputFormat -outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text -indir /gridmix/data/WebSimulationBlock -outdir /out My input files are those produced by the previous TextWriter job. Interestingly, for this case the disk is clearly the bottleneck. Worker's CPUs are ~50% idle + ~20% waiting for I/O, i.e. The CPUs are only 30% busy. All disks are virtually saturated. My guess for the discrepancy between the read and the write case is due to the fact that the datanode uses sendfile for the read case. Maps don't/can't use sendfile though for the write case. I'll try to look deeper into this.
        Hide
        Arun C Murthy added a comment -

        I'm starting to think this is caused purely by HDFS-297.

        Show
        Arun C Murthy added a comment - I'm starting to think this is caused purely by HDFS-297 .
        Hide
        Khaled Elmeleegy added a comment -

        Doesn't reading involve computing CRC as well? If so, why would there be a
        significant discrepancy between read and write performance.

        Show
        Khaled Elmeleegy added a comment - Doesn't reading involve computing CRC as well? If so, why would there be a significant discrepancy between read and write performance.
        Hide
        Chris Douglas added a comment -

        RandomTextWriter is probably spending most of its CPU doing its work inefficiently, mostly in generateSentence and Text::encode. For each word, generateSentence generates a random number, writes a String into a StringBuffer, which gets written out as full String, then encoded as Text, then it's finally written out after looking up the counters in the Context for that particular record. This process generates a lot of garbage, so Owen and Arun's hypothesis that we're spending an inordinate amount of time in GC seems well founded.

        The attached should be more sparing of the CPU. Would you mind confirming?

        Show
        Chris Douglas added a comment - RandomTextWriter is probably spending most of its CPU doing its work inefficiently, mostly in generateSentence and Text::encode. For each word, generateSentence generates a random number, writes a String into a StringBuffer, which gets written out as full String, then encoded as Text, then it's finally written out after looking up the counters in the Context for that particular record. This process generates a lot of garbage, so Owen and Arun's hypothesis that we're spending an inordinate amount of time in GC seems well founded. The attached should be more sparing of the CPU. Would you mind confirming?
        Hide
        Chris Douglas added a comment -

        Removed whitespace changes

        Show
        Chris Douglas added a comment - Removed whitespace changes
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12413936/MR712-1.patch
        against trunk revision 795454.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/410/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/410/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/410/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/410/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12413936/MR712-1.patch against trunk revision 795454. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/410/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/410/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/410/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/410/console This message is automatically generated.
        Hide
        Khaled Elmeleegy added a comment -

        I tested the patch on the same configuration I used before. It improved things. Using chukwa, this disk is now the bottleneck. However, the cpu is still ~60%-70% busy.

        Show
        Khaled Elmeleegy added a comment - I tested the patch on the same configuration I used before. It improved things. Using chukwa, this disk is now the bottleneck. However, the cpu is still ~60%-70% busy.
        Hide
        Hong Tang added a comment -

        Patch looks good. +1.

        Show
        Hong Tang added a comment - Patch looks good. +1.
        Hide
        Chris Douglas added a comment -

        I committed this.

        Show
        Chris Douglas added a comment - I committed this.

          People

          • Assignee:
            Chris Douglas
            Reporter:
            Khaled Elmeleegy
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development