Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4070

DFSClient ignores bufferSize argument & always performs small writes

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.0.3, 2.0.3-alpha
    • Fix Version/s: None
    • Component/s: hdfs-client
    • Environment:

      RHEL 5.5 x86_64 (ec2)

    • Release Note:
      Allow write packet sizes to be configurable for DFS input streams

      Description

      The following code illustrates the issue at hand

       protected void map(LongWritable offset, Text value, Context context) 
          	        throws IOException, InterruptedException {
      			OutputStream out = fs.create(new Path("/tmp/benchmark/",value.toString()), true, 1024*1024); 
      			int i;
      			for(i = 0; i < 1024*1024; i++) {
      				out.write(buffer, 0, 1024);
      			}
      			out.close();
      			context.write(value, new IntWritable(i));
          	}
      

      This code is run as a single map-only task with an input file on disk and map-output to disk.

      # su - hdfs -c 'hadoop jar /tmp/dfs-test-1.0-SNAPSHOT-job.jar file:///tmp/list file:///grid/0/hadoop/hdfs/tmp/benchmark'

      In the data node disk access patterns, the following consistent pattern was observed irrespective of bufferSize provided.

      21119 read(58,  <unfinished ...>
      21119 <... read resumed> "\0\1\0\0\0\0\0\0\0034\212\0\0\0\0\0\0\0+\220\0\0\0\376\0\262\252ux\262\252u"..., 65557) = 65557
      21119 lseek(107, 0, SEEK_CUR <unfinished ...>
      21119 <... lseek resumed> )             = 53774848
      21119 write(107, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65024 <unfinished ...>
      21119 <... write resumed> )             = 65024
      21119 write(108, "\262\252ux\262\252ux\262\252ux\262\252ux\262\252ux\262\252ux\262\252ux\262\252ux"..., 508 <unfinished ...>
      21119 <... write resumed> )             = 508
      

      Here fd 58 is the incoming socket, 107 is the blk file and 108 is the .meta file.

      The DFS packet size ignores the bufferSize argument and suffers from suboptimal syscall & disk performance because of the default 64kb value, as is obvious from the interrupted read/write operations.

      Changing the packet size to a more optimal 1056405 bytes results in a decent spike in performance, by cutting down on disk & network iops.

      Average time (milliseconds) for a 10 GB write as 10 files in a single map task

      timestamp 65536 1056252
      1350469614 88530 78662
      1350469827 88610 81680
      1350470042 92632 78277
      1350470261 89726 79225
      1350470476 92272 78265
      1350470696 89646 81352
      1350470913 92311 77281
      1350471132 89632 77601
      1350471345 89302 81530
      1350471564 91844 80413

      That is by average an increase from ~115 MB/s to ~130 MB/s, by modifying the global packet size setting.

      This suggests that there is value in adapting the user provided buffer sizes to hadoop packet sizing, per stream.

        Issue Links

          Activity

          Hide
          Gopal V added a comment -

          Redoing patch due to potential for buffer bloat & OOM on the data node end of the system.

          Will update patch with an upper-bound as well.

          Show
          Gopal V added a comment - Redoing patch due to potential for buffer bloat & OOM on the data node end of the system. Will update patch with an upper-bound as well.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12549678/HDFS-4090-dfs%2Bpacketsize.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3395//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3395//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549678/HDFS-4090-dfs%2Bpacketsize.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3395//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3395//console This message is automatically generated.
          Hide
          Gopal V added a comment -

          dfs write packet size is currently a global tunable - turn this into a per-stream tunable

          Show
          Gopal V added a comment - dfs write packet size is currently a global tunable - turn this into a per-stream tunable
          Hide
          Gopal V added a comment -

          Since it was suggested to me that the 64kb number was related to the max TCP packet size, I tried to run a 3 node cluster (+1 name node) on EC2 (with hadoop 1.0.3) to pull in all the replication & network factors.

          The results are pretty similar - I planned to do 10 runs writing 64x1Gb of data, each map controlled by a single file with a single name (like TestDFSIO). Every map was data-local as there were 3 data nodes and replication level of 3.

          Six runs had at least one failed job, so they were skipped for results.

          timestamp 65536 1056252
          1350932059 945599 863490
          1350936449 998821 930685
          1350956097 1290333 1004663
          1350958583 1118328 1096637

          So this is not merely about the disk IOPS, but about the overall system doing fewer syscalls.

          Show
          Gopal V added a comment - Since it was suggested to me that the 64kb number was related to the max TCP packet size, I tried to run a 3 node cluster (+1 name node) on EC2 (with hadoop 1.0.3) to pull in all the replication & network factors. The results are pretty similar - I planned to do 10 runs writing 64x1Gb of data, each map controlled by a single file with a single name (like TestDFSIO). Every map was data-local as there were 3 data nodes and replication level of 3. Six runs had at least one failed job, so they were skipped for results. timestamp 65536 1056252 1350932059 945599 863490 1350936449 998821 930685 1350956097 1290333 1004663 1350958583 1118328 1096637 So this is not merely about the disk IOPS, but about the overall system doing fewer syscalls.
          Hide
          Gopal V added a comment -

          DFS client patch to support buffersize.

          PS: how do I write a test for this?

          Show
          Gopal V added a comment - DFS client patch to support buffersize. PS: how do I write a test for this?
          Hide
          Gopal V added a comment -

          I agree that the kernels should be doing IO merging and handle concurrent disk-writes well, because they have global knowledge of I/O patterns. Ideally, the userland code should just let that be and write large, page aligned chunks of data & let the kernel do the actual magic.

          In this case, the userland layer is actually being unnecessarily chatty on syscalls. I'd hazard a guess that the performance boost is simply due to the reduced number of syscalls & from letting the kernel do fewer wake-ups of threads running CPU bound code (check-sums and such).

          Before I benchmarked it, I was more annoyed about the fact that the system out-right overrides my buffer settings (a decent tunable, if you will).

          I don't think the 15% number directly translates into any other benchmark, this obviously highlights the DFS writes & without any other operations in the middle.

          This was tested on an RHEL 5.5 box on EC2 (with 4 EBS volumes backing hdfs).

          2.6.18-194.32.1.el5xen #1 SMP

          A real bare metal box would behave differently (I've ordered a new SSD backed box, it'll arrive in parts & get all working by next week I guess).

          In the mean time, if you can bench this on some real hardware, I'll have some foundation to my theories (beyond averaging stuff on EC2).

          Show
          Gopal V added a comment - I agree that the kernels should be doing IO merging and handle concurrent disk-writes well, because they have global knowledge of I/O patterns. Ideally, the userland code should just let that be and write large, page aligned chunks of data & let the kernel do the actual magic. In this case, the userland layer is actually being unnecessarily chatty on syscalls. I'd hazard a guess that the performance boost is simply due to the reduced number of syscalls & from letting the kernel do fewer wake-ups of threads running CPU bound code (check-sums and such). Before I benchmarked it, I was more annoyed about the fact that the system out-right overrides my buffer settings (a decent tunable, if you will). I don't think the 15% number directly translates into any other benchmark, this obviously highlights the DFS writes & without any other operations in the middle. This was tested on an RHEL 5.5 box on EC2 (with 4 EBS volumes backing hdfs). 2.6.18-194.32.1.el5xen #1 SMP A real bare metal box would behave differently (I've ordered a new SSD backed box, it'll arrive in parts & get all working by next week I guess). In the mean time, if you can bench this on some real hardware, I'll have some foundation to my theories (beyond averaging stuff on EC2).
          Hide
          Gopal V added a comment -

          My benchmark code.

          Or alternatively checkout git@gist.github.com:fe319436b880026cbad4.git

          Show
          Gopal V added a comment - My benchmark code. Or alternatively checkout git@gist.github.com:fe319436b880026cbad4.git
          Hide
          Andy Isaacson added a comment -

          I'm a little surprised that increasing buffer size from 64k to 1M would make a 15% difference in throughput. Could you post a complete example benchmark that demonstrates the effect? Your map() makes sense, but I don't want to flesh it out with code that "makes sense" to me but might be different from how you're testing.

          What kernel precisely are you observing this on? RHEL5 has merged a fair number of IO performance-bug-fixes over the years.

          Does the issue reproduce on a more modern kernel, such as RHEL6? (I would be happy to test this case for you given a working example.)

          Show
          Andy Isaacson added a comment - I'm a little surprised that increasing buffer size from 64k to 1M would make a 15% difference in throughput. Could you post a complete example benchmark that demonstrates the effect? Your map() makes sense, but I don't want to flesh it out with code that "makes sense" to me but might be different from how you're testing. What kernel precisely are you observing this on? RHEL5 has merged a fair number of IO performance-bug-fixes over the years. Does the issue reproduce on a more modern kernel, such as RHEL6? (I would be happy to test this case for you given a working example.)
          Hide
          Uma Maheswara Rao G added a comment -

          Ok, make sense. Thanks for working on this Gopal.

          Show
          Uma Maheswara Rao G added a comment - Ok, make sense. Thanks for working on this Gopal.
          Hide
          Gopal V added a comment -

          The property is global, which makes it impossible for a client to open streams of varying buffers.

          I am suggesting using the buffersize argument in fs.create() to produce a suitable packetsize for that stream.

          Right now for hadoop-1.x, my math says

          
          bufferSize &= ~bytesPerChecksum
          
          writePacketSize = (bufferSize - bytesPerChecksum) 
                            + checksumSize * ((bufferSize - bytesPerChecksum)/bytesPerChecksum) 
                            + bytesPerChecksum/checksumSize + PKT_HEADER_LEN + SIZE_OF_INTEGER
          

          which is where the 1056405 derives from 1048576 (i.e contains exactly aligned 1048576 byte writes to block & 8192 byte checksum writes).

          Working on a patch, will put it up for review.

          Show
          Gopal V added a comment - The property is global, which makes it impossible for a client to open streams of varying buffers. I am suggesting using the buffersize argument in fs.create() to produce a suitable packetsize for that stream. Right now for hadoop-1.x, my math says bufferSize &= ~bytesPerChecksum writePacketSize = (bufferSize - bytesPerChecksum) + checksumSize * ((bufferSize - bytesPerChecksum)/bytesPerChecksum) + bytesPerChecksum/checksumSize + PKT_HEADER_LEN + SIZE_OF_INTEGER which is where the 1056405 derives from 1048576 (i.e contains exactly aligned 1048576 byte writes to block & 8192 byte checksum writes). Working on a patch, will put it up for review.
          Hide
          Uma Maheswara Rao G added a comment -

          Gopal, I agree, we are ignoring the buffer size. also I have seen similar/same issue HDFS-3953

          But we have a configured property to increase packet size right?

           public static final String  DFS_CLIENT_WRITE_PACKET_SIZE_KEY = "dfs.client-write-packet-size";
            public static final int     DFS_CLIENT_WRITE_PACKET_SIZE_DEFAULT = 64*1024;
          

          are you suggesting that, use user provided buffer instead of packetsize configured? could you please clarify?
          And yes, having more smaller writes can hit the disk iops overhead, as we will allow concurrent disk writes.
          Worth revisiting the default sizes as Suresh suggested.

          Show
          Uma Maheswara Rao G added a comment - Gopal, I agree, we are ignoring the buffer size. also I have seen similar/same issue HDFS-3953 But we have a configured property to increase packet size right? public static final String DFS_CLIENT_WRITE_PACKET_SIZE_KEY = "dfs.client-write-packet-size" ; public static final int DFS_CLIENT_WRITE_PACKET_SIZE_DEFAULT = 64*1024; are you suggesting that, use user provided buffer instead of packetsize configured? could you please clarify? And yes, having more smaller writes can hit the disk iops overhead, as we will allow concurrent disk writes. Worth revisiting the default sizes as Suresh suggested.
          Hide
          Suresh Srinivas added a comment -

          Gopal, perhaps we can also revisit the default and change it as well.

          Show
          Suresh Srinivas added a comment - Gopal, perhaps we can also revisit the default and change it as well.

            People

            • Assignee:
              Unassigned
              Reporter:
              Gopal V
            • Votes:
              0 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:

                Development