Hadoop Common
  1. Hadoop Common
  2. HADOOP-3062

Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: metrics
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Introduced additional log records for data transfers.

      Description

      In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know
      what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network
      bandwidth is the bottleneck when certain jobs are running on a cluster.

      1. 3062-5.patch
        22 kB
        Chris Douglas
      2. 3062-4.patch
        22 kB
        Chris Douglas
      3. 3062-3.patch
        21 kB
        Chris Douglas
      4. 3062-2.patch
        17 kB
        Chris Douglas
      5. 3062-1.patch
        17 kB
        Chris Douglas
      6. 3062-0.patch
        12 kB
        Chris Douglas

        Issue Links

          Activity

          Hide
          Chris Douglas added a comment -

          The analysis should leverage HADOOP-3719, so this issue should cover the log4j appender emitting the HDFS and shuffling data. There are a few open questions and arguable assumptions:

          • Should this count bytes successfully transferred separately from failed transfers? Should failed transfers be logged at all?
          • The header/metadata/etc. traffic is assumed to be a negligible fraction of the total network traffic and irrelevant to the analysis for a particular job. The overall network utilization is also best measured using standard monitoring utilities that don't require any knowledge of Hadoop. This will focus on tracking block traffic over HDFS (reads, writes, replications) and map output fetched during the shuffle, only.
          • For local reads, the source and destination IP will match. This should be sufficient to detect and discard during analysis of network traffic, but will not be sufficient to account for all reads from the local disk (counters and job history are likely better tools for this).
          • Accounting for topology (to break down by racks, etc.) is best deferred to the analysis. Logging changes in topology would also be helpful, though I don't know whether Hadoop has sufficient information to do this in the general case.
          • If job information is available (in the shuffle), should it be included in the entry? Doing this for HDFS is non-trivial, but would be invaluable to the analysis. I'm not certain how to do this, yet. Of course, replications and rebalancing won't include this, and HDFS reads prior to job submission (and all other traffic from JobClient) will likely be orphaned, as well.
          • Should this include start/end entries so one can infer how long the transfer took?
          • What about DistributedCache? Can it be ignored as part of the job setup, which is already omitted?

          In general, the format will follow:

          <log4j schema including timestamp, etc.> source: <src IP>, destination: <dst IP>, bytes: <bytes>, operation: <op enum>[, taskid: <TaskID>]
          

          Where <(src|dst) IP> is the IP address of the source and destination nodes, <bytes> is a long, and <op enum> is one of HDFS_READ, HDFS_WRITE, HDFS_COPY, and MAPRED_SHUFFLE. HDFS_REPLACE should be redundant if HDFS_COPY is recorded (I think). The rebalancing traffic isn't relevant to job analysis, but if one is including sufficient information to determine the duration of each transfer it may be interesting. The TaskID should be sufficient, but one could argue that including the JobID would be useful as a point to join on.

          Thoughts?

          Show
          Chris Douglas added a comment - The analysis should leverage HADOOP-3719 , so this issue should cover the log4j appender emitting the HDFS and shuffling data. There are a few open questions and arguable assumptions: Should this count bytes successfully transferred separately from failed transfers? Should failed transfers be logged at all? The header/metadata/etc. traffic is assumed to be a negligible fraction of the total network traffic and irrelevant to the analysis for a particular job. The overall network utilization is also best measured using standard monitoring utilities that don't require any knowledge of Hadoop. This will focus on tracking block traffic over HDFS (reads, writes, replications) and map output fetched during the shuffle, only. For local reads, the source and destination IP will match. This should be sufficient to detect and discard during analysis of network traffic, but will not be sufficient to account for all reads from the local disk (counters and job history are likely better tools for this). Accounting for topology (to break down by racks, etc.) is best deferred to the analysis. Logging changes in topology would also be helpful, though I don't know whether Hadoop has sufficient information to do this in the general case. If job information is available (in the shuffle), should it be included in the entry? Doing this for HDFS is non-trivial, but would be invaluable to the analysis. I'm not certain how to do this, yet. Of course, replications and rebalancing won't include this, and HDFS reads prior to job submission (and all other traffic from JobClient) will likely be orphaned, as well. Should this include start/end entries so one can infer how long the transfer took? What about DistributedCache? Can it be ignored as part of the job setup, which is already omitted? In general, the format will follow: <log4j schema including timestamp, etc.> source: <src IP>, destination: <dst IP>, bytes: <bytes>, operation: <op enum>[, taskid: <TaskID>] Where <(src|dst) IP> is the IP address of the source and destination nodes, <bytes> is a long, and <op enum> is one of HDFS_READ , HDFS_WRITE , HDFS_COPY , and MAPRED_SHUFFLE . HDFS_REPLACE should be redundant if HDFS_COPY is recorded (I think). The rebalancing traffic isn't relevant to job analysis, but if one is including sufficient information to determine the duration of each transfer it may be interesting. The TaskID should be sufficient, but one could argue that including the JobID would be useful as a point to join on. Thoughts?
          Hide
          Chris Douglas added a comment -

          First draft.

          Format:

          <log4j schema including timestamp, etc.> src: <src IP>, dest: <dst IP>, bytes: <bytes>, op: <op enum>, id: <DFSClient id|taskid>[, blockid: <block id>] 
          

          The patch adds the DFSClient clientName to OP_READ_BLOCK and changes the String in OP_WRITE_BLOCK from the path- which is unused- to the clientName. Is this is set to DFSClient_<taskid> in map and reduce tasks, tracing the output of a job should be straightforward after some processing of each entry. Writes for replications (where the clientName is "") are logged as they have been; the logging in PacketResponder has been reformatted to fit the preceding schema. A few known issues:

          • The logging assumes the IP address is sufficient to distinguish a source, particularly for writes and in the shuffle
          • This logs to the DataNode and ReduceTask appenders; these entries should be directed elsewhere and disabled by default
          • In testing this, some entries in the read exhibited a strange property: the source and destination match, but neither matches the DataNode on which it is logged. I'm clearly missing something.

          I tried tracing a few blocks and map outputs through the logs and all made sense. That said- as mentioned in the last bullet- not all of the entries made sense.

          Show
          Chris Douglas added a comment - First draft. Format: <log4j schema including timestamp, etc.> src: <src IP>, dest: <dst IP>, bytes: <bytes>, op: <op enum>, id: <DFSClient id|taskid>[, blockid: <block id>] The patch adds the DFSClient clientName to OP_READ_BLOCK and changes the String in OP_WRITE_BLOCK from the path- which is unused- to the clientName. Is this is set to DFSClient_<taskid> in map and reduce tasks, tracing the output of a job should be straightforward after some processing of each entry. Writes for replications (where the clientName is "") are logged as they have been; the logging in PacketResponder has been reformatted to fit the preceding schema. A few known issues: The logging assumes the IP address is sufficient to distinguish a source, particularly for writes and in the shuffle This logs to the DataNode and ReduceTask appenders; these entries should be directed elsewhere and disabled by default In testing this, some entries in the read exhibited a strange property: the source and destination match, but neither matches the DataNode on which it is logged. I'm clearly missing something. I tried tracing a few blocks and map outputs through the logs and all made sense. That said- as mentioned in the last bullet- not all of the entries made sense.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12387698/3062-0.patch
          against trunk revision 683448.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387698/3062-0.patch against trunk revision 683448. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/console This message is automatically generated.
          Hide
          Lohit Vijayarenu added a comment -

          For this

          and break them down by racks

          Is there any information logged about this?

          Show
          Lohit Vijayarenu added a comment - For this and break them down by racks Is there any information logged about this?
          Hide
          Chris Douglas added a comment -

          Is there any information logged about [breakdown by racks]?

          No, that's handled in the analysis. I don't think the datanodes or the reduce tasks know about topology, anyway.

          Show
          Chris Douglas added a comment - Is there any information logged about [breakdown by racks] ? No, that's handled in the analysis. I don't think the datanodes or the reduce tasks know about topology, anyway.
          Hide
          Chris Douglas added a comment -

          Patch was mauled by HADOOP-3935 and the second and third (HADOOP-3658) bullets should be addressed.

          Show
          Chris Douglas added a comment - Patch was mauled by HADOOP-3935 and the second and third ( HADOOP-3658 ) bullets should be addressed.
          Hide
          Chris Douglas added a comment -

          Verified results with a randomwriter/sort run

          Show
          Chris Douglas added a comment - Verified results with a randomwriter/sort run
          Hide
          Tsz Wo Nicholas Sze added a comment -
          • Should we check whether ClientTraceLog.isInfoEnabled() before logging?
          • Should we define an AUDIT_FORMAT for the log messages, like FSNamesystem.AUDIT_FORMAT?
          • I think it might worth to create a utility class, say org.apache.hadoop.log.AuditLog, so that we could put AUDIT_FORMAT, isInfoEnabled(), etc. inside it. Then, both DataNode and FSNamesystem can use it.
          Show
          Tsz Wo Nicholas Sze added a comment - Should we check whether ClientTraceLog.isInfoEnabled() before logging? Should we define an AUDIT_FORMAT for the log messages, like FSNamesystem.AUDIT_FORMAT? I think it might worth to create a utility class, say org.apache.hadoop.log.AuditLog, so that we could put AUDIT_FORMAT, isInfoEnabled(), etc. inside it. Then, both DataNode and FSNamesystem can use it.
          Hide
          Chris Douglas added a comment -

          Should we check whether ClientTraceLog.isInfoEnabled() before logging?

          Excluding the string concatenation to produce the actual, the cost of each log message is low or infrequent (like the shuffle message). Excluding the new read log message, it's comparable to the logging that's already happening. I'm not certain if the logging this replaces (for client writes) should occur when ClientTraceLog.inInfoEnabled() is false, since nothing would be logged in that case...

          Should we define an AUDIT_FORMAT for the log messages, like FSNamesystem.AUDIT_FORMAT?

          Unlike the FSNamesystem audit format, these are going to require some additional processing to be useful (e.g. the id param, optional block id), so the key/value pairing doesn't offer the same syntactical guarantees. That said, you're probably right, but unless we adopt a packaging like what you suggest in your following point, we'd introduce a link between hdfs and mapred. For now- with only these few messages- I don't think it gains much by being pulled out.

          I think it might worth to create a utility class, say org.apache.hadoop.log.AuditLog, so that we could put AUDIT_FORMAT, isInfoEnabled(), etc. inside it. Then, both DataNode and FSNamesystem can use it.

          Agreed: it would be better if there were a more central location for Hadoop APIs exported through the logging interfaces, like audit logs and these metrics. If nothing else, it would let us know which messages have consumers (hence the uncertainty for logging client writes). That's likely part of a different patch, though.

          Show
          Chris Douglas added a comment - Should we check whether ClientTraceLog.isInfoEnabled() before logging? Excluding the string concatenation to produce the actual, the cost of each log message is low or infrequent (like the shuffle message). Excluding the new read log message, it's comparable to the logging that's already happening. I'm not certain if the logging this replaces (for client writes) should occur when ClientTraceLog.inInfoEnabled() is false, since nothing would be logged in that case... Should we define an AUDIT_FORMAT for the log messages, like FSNamesystem.AUDIT_FORMAT? Unlike the FSNamesystem audit format, these are going to require some additional processing to be useful (e.g. the id param, optional block id), so the key/value pairing doesn't offer the same syntactical guarantees. That said, you're probably right, but unless we adopt a packaging like what you suggest in your following point, we'd introduce a link between hdfs and mapred. For now- with only these few messages- I don't think it gains much by being pulled out. I think it might worth to create a utility class, say org.apache.hadoop.log.AuditLog, so that we could put AUDIT_FORMAT, isInfoEnabled(), etc. inside it. Then, both DataNode and FSNamesystem can use it. Agreed: it would be better if there were a more central location for Hadoop APIs exported through the logging interfaces, like audit logs and these metrics. If nothing else, it would let us know which messages have consumers (hence the uncertainty for logging client writes). That's likely part of a different patch, though.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Got it. Let's work on a utility class in future if there is a need.

          +1 the patch is good.

          Show
          Tsz Wo Nicholas Sze added a comment - Got it. Let's work on a utility class in future if there is a need. +1 the patch is good.
          Hide
          Chris Douglas added a comment - - edited

          Updated based on Nicholas's feedback, i.e. added isInfoEnabled guards around appropriate log stmts. Also removed the irrelevant replication log message. I'll commit this if Hudson doesn't object.

          Show
          Chris Douglas added a comment - - edited Updated based on Nicholas's feedback, i.e. added isInfoEnabled guards around appropriate log stmts. Also removed the irrelevant replication log message. I'll commit this if Hudson doesn't object.
          Hide
          Chris Douglas added a comment -
          • Moved mapred logging from ReduceTask to the TaskTracker
          • Changed HDFS_READ logging to record bytes actually read from datanode rather than bytes requested
          • Put *.clienttrace format into TaskTracker, DataNode
          Show
          Chris Douglas added a comment - Moved mapred logging from ReduceTask to the TaskTracker Changed HDFS_READ logging to record bytes actually read from datanode rather than bytes requested Put *.clienttrace format into TaskTracker, DataNode
          Hide
          Chris Douglas added a comment -
          • Added storageID to datanode string
          • Replaced redundant log message

          This probably needs only one more pass in review.

          Show
          Chris Douglas added a comment - Added storageID to datanode string Replaced redundant log message This probably needs only one more pass in review.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 new patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 new patch looks good.
          Hide
          Chris Douglas added a comment -
               [exec] -1 overall.
          
               [exec]     +1 @author.  The patch does not contain any @author tags.
          
               [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
          
               [exec]     -1 javadoc.  The javadoc tool appears to have generated 1 warning messages.
          
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
          
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
          

          Fixed a findbugs warning, javadoc remains unrelated, and passes unit tests.

          Show
          Chris Douglas added a comment - [exec] -1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. Fixed a findbugs warning, javadoc remains unrelated, and passes unit tests.
          Hide
          Chris Douglas added a comment -

          I just committed this.

          Show
          Chris Douglas added a comment - I just committed this.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )

            People

            • Assignee:
              Chris Douglas
              Reporter:
              Runping Qi
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development