Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: datanode, performance
    • Labels:
      None

      Description

      In benchmarking HDFS-941 I noticed that for the random read workload, the FSDataset lock is highly contended. After converting it to a ReentrantReadWriteLock, I saw a ~25% improvement on both latency and ops/second.

      1. hdfs-1148-trunk.txt
        46 kB
        Todd Lipcon
      2. patch-HDFS-1148-rel0.20.2.txt
        12 kB
        Dave Thompson
      3. hdfs-1148-old.txt
        12 kB
        Todd Lipcon

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          389d 20h 15m 1 Todd Lipcon 06/Jun/11 06:15
          Patch Available Patch Available Open Open
          13h 2m 1 Todd Lipcon 06/Jun/11 19:18
          Yong Zhang made changes -
          Assignee Dave Thompson [ davet ] Yong Zhang [ zhangyongxyz ]
          Hide
          Yong Zhang added a comment -

          Yes, but I think we still have some apps with remote reading. Can you please assert this task to me or I will take it? and same for HDFS-3767.

          Show
          Yong Zhang added a comment - Yes, but I think we still have some apps with remote reading. Can you please assert this task to me or I will take it? and same for HDFS-3767 .
          Hide
          Dave Thompson added a comment -

          5 years out, I am not working on this, nor am I up on it's current relevance. Please close it out as you see fit, or take the assignee.

          Show
          Dave Thompson added a comment - 5 years out, I am not working on this, nor am I up on it's current relevance. Please close it out as you see fit, or take the assignee.
          Hide
          Todd Lipcon added a comment -

          I don't think this is nearly as useful anymore now that HBase and other apps that do a large number of random reads are usually using short-circuit read and thus avoiding any datanode code in their hot path. If we want to optimize the remote read random read rate, this would be useful for that purpose – just doesn't seem urgent.

          Show
          Todd Lipcon added a comment - I don't think this is nearly as useful anymore now that HBase and other apps that do a large number of random reads are usually using short-circuit read and thus avoiding any datanode code in their hot path. If we want to optimize the remote read random read rate, this would be useful for that purpose – just doesn't seem urgent.
          Hide
          Yong Zhang added a comment -

          Hi everyone, what is the status for this improvement?

          Show
          Yong Zhang added a comment - Hi everyone, what is the status for this improvement?
          Jeff Hammerbacher made changes -
          Link This issue relates to HDFS-3767 [ HDFS-3767 ]
          Todd Lipcon made changes -
          Component/s performance [ 12316501 ]
          Dave Thompson made changes -
          Assignee Todd Lipcon [ tlipcon ] Dave Thompson [ davet ]
          Todd Lipcon made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Todd Lipcon added a comment -

          looks like this caused test timeouts: https://builds.apache.org/job/PreCommit-HDFS-Build/712/console

          cancelling path for further work. If anyone wants to work on it, feel free to reassign to yourself, not sure how much time I'll have to work on it.

          Show
          Todd Lipcon added a comment - looks like this caused test timeouts: https://builds.apache.org/job/PreCommit-HDFS-Build/712/console cancelling path for further work. If anyone wants to work on it, feel free to reassign to yourself, not sure how much time I'll have to work on it.
          Todd Lipcon made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Todd Lipcon added a comment -

          Uploaded another copy of the diff that ignores whitespace at http://cloudera-todd.s3.amazonaws.com/1148-whitespace.txt – easier to read that way.

          Show
          Todd Lipcon added a comment - Uploaded another copy of the diff that ignores whitespace at http://cloudera-todd.s3.amazonaws.com/1148-whitespace.txt – easier to read that way.
          Todd Lipcon made changes -
          Attachment hdfs-1148-trunk.txt [ 12481537 ]
          Hide
          Todd Lipcon added a comment -

          Here's a patch against trunk.

          When combined with HDFS-941, this yields another 15-25% or so speed-up for parts of TestParallelRead on my machine. The CPU usage of the datanode increases from ~400% to ~750% (this is a good thing - it's using available cores better)

          Show
          Todd Lipcon added a comment - Here's a patch against trunk. When combined with HDFS-941 , this yields another 15-25% or so speed-up for parts of TestParallelRead on my machine. The CPU usage of the datanode increases from ~400% to ~750% (this is a good thing - it's using available cores better)
          Hide
          Todd Lipcon added a comment -

          As I was updating HDFS-941 to trunk tonight, I took the opportunity to look into the blocking behavior again. While running TestParallelRead (with N_ITERATIONS bumped up 10x) I ran:

          $ while true ; do jstack 3378 | grep -A2 BLOCK >> /tmp/blocked ; done
          

          and then when it was done:

          $ grep 'at ' /tmp/blocked  | sort | uniq -c | sort -nk1
                1         at java.lang.Object.wait(Native Method)
                6         at org.apache.hadoop.hdfs.DFSInputStream.getBlockRange(DFSInputStream.java:313)
               27         at org.apache.hadoop.hdfs.TestParallelRead$ReadWorker.read(TestParallelRead.java:142)
              137         at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockInputStream(FSDataset.java:1286)
              183         at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:100)
              220         at org.apache.hadoop.hdfs.DFSInputStream.getFileLength(DFSInputStream.java:206)
              251         at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1267)
          

          Three of the top four contention points are on the FSDataset monitor lock. The client-side DFSInputStream.getFileLength one is surprising, but not related to this particular JIRA.

          Show
          Todd Lipcon added a comment - As I was updating HDFS-941 to trunk tonight, I took the opportunity to look into the blocking behavior again. While running TestParallelRead (with N_ITERATIONS bumped up 10x) I ran: $ while true ; do jstack 3378 | grep -A2 BLOCK >> /tmp/blocked ; done and then when it was done: $ grep 'at ' /tmp/blocked | sort | uniq -c | sort -nk1 1 at java.lang. Object .wait(Native Method) 6 at org.apache.hadoop.hdfs.DFSInputStream.getBlockRange(DFSInputStream.java:313) 27 at org.apache.hadoop.hdfs.TestParallelRead$ReadWorker.read(TestParallelRead.java:142) 137 at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockInputStream(FSDataset.java:1286) 183 at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:100) 220 at org.apache.hadoop.hdfs.DFSInputStream.getFileLength(DFSInputStream.java:206) 251 at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1267) Three of the top four contention points are on the FSDataset monitor lock. The client-side DFSInputStream.getFileLength one is surprising, but not related to this particular JIRA.
          Hide
          Todd Lipcon added a comment -

          Todd, I want to understand what methods have lot contention

          I actually don't remember anymore - this was a while back that I saw this, and only once I added HDFS-941. Since it was a read workload, it makes sense that it would be getBlockInputStream, metaFileExists, getVisibleLength, and getMetaDataInputStream.

          I am not sure what you mean mostly uncontended, because I understand the problem description as there are lot of contentions

          Sorry, what I meant here is that, once it's converted to read-write lock, there is very little contention for the exclusive (write) lock. It's very rare to write small blocks, whereas small frequent reads come often from applications like HBase. So, we mostly see lots of readers and only the occasional writer.

          Show
          Todd Lipcon added a comment - Todd, I want to understand what methods have lot contention I actually don't remember anymore - this was a while back that I saw this, and only once I added HDFS-941 . Since it was a read workload, it makes sense that it would be getBlockInputStream, metaFileExists, getVisibleLength, and getMetaDataInputStream. I am not sure what you mean mostly uncontended, because I understand the problem description as there are lot of contentions Sorry, what I meant here is that, once it's converted to read-write lock, there is very little contention for the exclusive (write) lock. It's very rare to write small blocks, whereas small frequent reads come often from applications like HBase. So, we mostly see lots of readers and only the occasional writer.
          Hide
          Suresh Srinivas added a comment -

          Todd, I want to understand what methods have lot contention. Also you say:
          > As I understood it, the non-fair mode is more efficient. Since the lock is mostly uncontended

          I am not sure what you mean mostly uncontended, because I understand the problem description as there are lot of contentions...

          Show
          Suresh Srinivas added a comment - Todd, I want to understand what methods have lot contention. Also you say: > As I understood it, the non-fair mode is more efficient. Since the lock is mostly uncontended I am not sure what you mean mostly uncontended, because I understand the problem description as there are lot of contentions...
          Dave Thompson made changes -
          Attachment patch-HDFS-1148-rel0.20.2.txt [ 12474809 ]
          Hide
          Dave Thompson added a comment -

          For convenience, here's a port that will patch to release 0.20.2 of Todd's 1/21/11 patch.

          Show
          Dave Thompson added a comment - For convenience, here's a port that will patch to release 0.20.2 of Todd's 1/21/11 patch.
          Hide
          dhruba borthakur added a comment -

          I too have figured out lately that for these type of locks, non-fair mode is much better performance.

          Show
          dhruba borthakur added a comment - I too have figured out lately that for these type of locks, non-fair mode is much better performance.
          Hide
          Todd Lipcon added a comment -

          As I understood it, the non-fair mode is more efficient. Since the lock is mostly uncontended, I figured strict fairness wasn't important. Do you have some results that indicate the non-fair mode is a problem?

          Show
          Todd Lipcon added a comment - As I understood it, the non-fair mode is more efficient. Since the lock is mostly uncontended, I figured strict fairness wasn't important. Do you have some results that indicate the non-fair mode is a problem?
          Hide
          Dmytro Molkov added a comment -

          Todd, in your patch you are using ReentrantReadWriteLock in a Non-Fair mode, is this intentional?

          Show
          Dmytro Molkov added a comment - Todd, in your patch you are using ReentrantReadWriteLock in a Non-Fair mode, is this intentional?
          Sanjay Radia made changes -
          Link This issue is related to HDFS-1599 [ HDFS-1599 ]
          Todd Lipcon made changes -
          Attachment hdfs-1148-old.txt [ 12468998 ]
          Hide
          Todd Lipcon added a comment -

          Here's the old patch - it's against some old branch of mine, probably will need rework either for branch-20-append or for trunk

          Show
          Todd Lipcon added a comment - Here's the old patch - it's against some old branch of mine, probably will need rework either for branch-20-append or for trunk
          Hide
          dhruba borthakur added a comment -

          hi todd, if you have this patch, may we have a look? thanks.

          Show
          dhruba borthakur added a comment - hi todd, if you have this patch, may we have a look? thanks.
          Hide
          Todd Lipcon added a comment -

          pretty small in 0.20, but FSDataset is substantially reworked in trunk, haven't taken a look there. In general I just took a conservative approach, and made anything that might possibly change something use the write lock - even without being fancy it completely dropped this class off my jstacks.

          Show
          Todd Lipcon added a comment - pretty small in 0.20, but FSDataset is substantially reworked in trunk, haven't taken a look there. In general I just took a conservative approach, and made anything that might possibly change something use the write lock - even without being fancy it completely dropped this class off my jstacks.
          Hide
          dhruba borthakur added a comment -

          sounds like a good idea to me. how complicated is the codechange?

          Show
          dhruba borthakur added a comment - sounds like a good idea to me. how complicated is the codechange?
          Jeff Hammerbacher made changes -
          Field Original Value New Value
          Link This issue is related to HDFS-941 [ HDFS-941 ]
          Hide
          Todd Lipcon added a comment -

          YCSB output (same test setup as described in HDFS-941). This test is run with the HDFS-941 improvements plus FSDataset being converted to a readwrite lock.

          [OVERALL],RunTime(ms), 94325
          [OVERALL],Throughput(ops/sec), 10601.643254704479
          [READ], Operations, 1000000
          [READ], AverageLatency(ms), 3.747273
          [READ], MinLatency(ms), 0
          [READ], MaxLatency(ms), 1360
          [READ], 95thPercentileLatency(ms), 10
          [READ], 99thPercentileLatency(ms), 15
          [READ], Return=0, 1000000

          Show
          Todd Lipcon added a comment - YCSB output (same test setup as described in HDFS-941 ). This test is run with the HDFS-941 improvements plus FSDataset being converted to a readwrite lock. [OVERALL] ,RunTime(ms), 94325 [OVERALL] ,Throughput(ops/sec), 10601.643254704479 [READ] , Operations, 1000000 [READ] , AverageLatency(ms), 3.747273 [READ] , MinLatency(ms), 0 [READ] , MaxLatency(ms), 1360 [READ] , 95thPercentileLatency(ms), 10 [READ] , 99thPercentileLatency(ms), 15 [READ] , Return=0, 1000000
          Todd Lipcon created issue -

            People

            • Assignee:
              Yong Zhang
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:

                Development