Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.94.4, 0.95.0
    • Component/s: Performance
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Improves read concurrency by changing HFileBlock#readAtOffset from synchronized to reentrant lock; will have contending scanners fall back to preading rather than synchronized seek+read. Keeps all cores busy rather than just the one.

      Description

      HBase grinds to a halt when many threads scan along the same set of blocks and neither read short circuit is nor block caching is enabled for the dfs client ... disabling the block cache makes sense on very large scans.

      It turns out that synchronizing in istream in HFileBlock.readAtOffset is the culprit.

      1. 7336-0.94.txt
        3 kB
        Lars Hofhansl
      2. 7336-0.96.txt
        3 kB
        Lars Hofhansl

        Issue Links

          Activity

          Hide
          lhofhansl Lars Hofhansl added a comment -

          Here's my test case: 20m rows, single column family, single column, blockcache disabled for the scan, no HDFS short circuiting, all data fits into the OS buffer cache. Full scan over all rows.

          One client scanning: 15s (regionserver keeps one core busy ~120%)
          Two clients scanning along the same set of blocks: They both time out. (regionserver is a 5-6% CPU, clearly just waiting)

          Then I hacked readAtOffset to always do preads. Now:
          One client scanning: 39s (regionserver at ~120%)
          Two clients scanning: 39s each (regionserver at ~210%)

          With short circuiting enabled:
          One client: 15s (120% CPU)
          Two clients: 41s each (160% CPU)

          with pread:
          One client: 18s (160% CPU)
          Two clients: 19s each (250% CPU)

          Show
          lhofhansl Lars Hofhansl added a comment - Here's my test case: 20m rows, single column family, single column, blockcache disabled for the scan, no HDFS short circuiting, all data fits into the OS buffer cache. Full scan over all rows. One client scanning: 15s (regionserver keeps one core busy ~120%) Two clients scanning along the same set of blocks: They both time out. (regionserver is a 5-6% CPU, clearly just waiting) Then I hacked readAtOffset to always do preads. Now: One client scanning: 39s (regionserver at ~120%) Two clients scanning: 39s each (regionserver at ~210%) With short circuiting enabled: One client: 15s (120% CPU) Two clients: 41s each (160% CPU) with pread: One client: 18s (160% CPU) Two clients: 19s each (250% CPU)
          Hide
          lhofhansl Lars Hofhansl added a comment -

          I have a patch, that changes that synchronized to a ReentrantLock and reverts to pread if the lock cannot be acquired.
          That way we get seek + read when possible and pread when necessary or requested. Not that this is only for scans. Get default to pread anyway.

          With that patch I get the following behavior:
          One client: 15s
          Two clients: 22s each

          Which seems reasonable given above numbers.

          I wouldn't really consider this a proper fix, although I also wouldn't know what the proper fix should be.
          The limiting factor here is a single FSInputStream per store file, which can be devastating for performance if many threads read the same store file and it does not fit into the block cache (or the scans chose not to cache the blocks).

          Show
          lhofhansl Lars Hofhansl added a comment - I have a patch, that changes that synchronized to a ReentrantLock and reverts to pread if the lock cannot be acquired. That way we get seek + read when possible and pread when necessary or requested. Not that this is only for scans. Get default to pread anyway. With that patch I get the following behavior: One client: 15s Two clients: 22s each Which seems reasonable given above numbers. I wouldn't really consider this a proper fix, although I also wouldn't know what the proper fix should be. The limiting factor here is a single FSInputStream per store file, which can be devastating for performance if many threads read the same store file and it does not fit into the block cache (or the scans chose not to cache the blocks).
          Hide
          lhofhansl Lars Hofhansl added a comment -

          0.94 version of said patch.

          Show
          lhofhansl Lars Hofhansl added a comment - 0.94 version of said patch.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          And a trunk version

          Show
          lhofhansl Lars Hofhansl added a comment - And a trunk version
          Hide
          stack stack added a comment -

          Should the tryLock have a timeout on it?

          Else patch is an improvement for sure worth committing till we do better.

          I wonder what the difference is if more than one Reader.

          Looking at hdfs, each open goes to the namenode so can't open a reader per scanner... would cost a bit on setup? If we could distingush short from long scan we could do pread on the shorts and open a reader per Scanner for the long runners?

          Show
          stack stack added a comment - Should the tryLock have a timeout on it? Else patch is an improvement for sure worth committing till we do better. I wonder what the difference is if more than one Reader. Looking at hdfs, each open goes to the namenode so can't open a reader per scanner... would cost a bit on setup? If we could distingush short from long scan we could do pread on the shorts and open a reader per Scanner for the long runners?
          Hide
          lhofhansl Lars Hofhansl added a comment -

          tryLock without a timeout returns immediately. I think in this case it should, waiting here would potentially wait (at least a threads context switch) on each single next or get request.

          I think this issue would be most prominent in long scans, where the scans typically do not want to mess up the block cache.

          Multiple readers is probably the right solution. How many, though? And I am not sure how this would interact with the block cache (would probably be OK).

          Show
          lhofhansl Lars Hofhansl added a comment - tryLock without a timeout returns immediately. I think in this case it should, waiting here would potentially wait (at least a threads context switch) on each single next or get request. I think this issue would be most prominent in long scans, where the scans typically do not want to mess up the block cache. Multiple readers is probably the right solution. How many, though? And I am not sure how this would interact with the block cache (would probably be OK).
          Hide
          stack stack added a comment -

          tryLock without a timeout returns immediately.

          Sorry. Thought for some reason it waited.

          How many, though? And I am not sure how this would interact with the block cache (would probably be OK).

          One per scanner? Can we look at a scanner and figure somehow if its long or short? If start/stop row? Limit? Maybe allow someone pass a hint? Compactions should go get their own Reader?

          Show
          stack stack added a comment - tryLock without a timeout returns immediately. Sorry. Thought for some reason it waited. How many, though? And I am not sure how this would interact with the block cache (would probably be OK). One per scanner? Can we look at a scanner and figure somehow if its long or short? If start/stop row? Limit? Maybe allow someone pass a hint? Compactions should go get their own Reader?
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Compactions should go get their own Reader?

          That sounds like a save and important improvement.

          In other cases it actually seems best to try to get a stream and fall back to pread if that fails.

          Could drive # of reader by he size of the store file, something like a reader per n GB (n = 1 or 2 maybe). Then we round robin the readers.

          Should I commit this for now (assuming it passes HadoopQA and no objections), and we investigate other options further? Or discuss a bit more to see if we kind other options?

          Show
          lhofhansl Lars Hofhansl added a comment - Compactions should go get their own Reader? That sounds like a save and important improvement. In other cases it actually seems best to try to get a stream and fall back to pread if that fails. Could drive # of reader by he size of the store file, something like a reader per n GB (n = 1 or 2 maybe). Then we round robin the readers. Should I commit this for now (assuming it passes HadoopQA and no objections), and we investigate other options further? Or discuss a bit more to see if we kind other options?
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12560514/7336-0.96.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          -1 javadoc. The javadoc tool appears to have generated 104 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 23 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestMultiParallel

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12560514/7336-0.96.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. -1 javadoc . The javadoc tool appears to have generated 104 warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. -1 findbugs . The patch appears to introduce 23 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.client.TestMultiParallel Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3490//console This message is automatically generated.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          TestMultiParallel passed locally.

          Show
          lhofhansl Lars Hofhansl added a comment - TestMultiParallel passed locally.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Any objections to committing this (0.94 and 0.96). I'm pretty sure it won't make things worse, and it provably improves some scenarios.

          Show
          lhofhansl Lars Hofhansl added a comment - Any objections to committing this (0.94 and 0.96). I'm pretty sure it won't make things worse, and it provably improves some scenarios.
          Hide
          lhofhansl Lars Hofhansl added a comment - - edited

          Numbers with patch and read short circuit enabled for completeness:
          One client: 15s (104%)
          Two clients: 22s (180% CPU)

          Show
          lhofhansl Lars Hofhansl added a comment - - edited Numbers with patch and read short circuit enabled for completeness: One client: 15s (104%) Two clients: 22s (180% CPU)
          Hide
          xieliang007 Liang Xie added a comment -

          +1, LGTM,a nice test case, Lars

          Show
          xieliang007 Liang Xie added a comment - +1, LGTM,a nice test case, Lars
          Hide
          stack stack added a comment -

          +1 on committing this for now.

          On Reader per long-running scanner, it will complicate the swapping in of new files on compaction but probably worth figuring out. Lets file issues for further improvement.

          Good stuff Lars.

          Show
          stack stack added a comment - +1 on committing this for now. On Reader per long-running scanner, it will complicate the swapping in of new files on compaction but probably worth figuring out. Lets file issues for further improvement. Good stuff Lars.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Thanks for the reviews.

          Show
          lhofhansl Lars Hofhansl added a comment - Thanks for the reviews.
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-0.94 #625 (See https://builds.apache.org/job/HBase-0.94/625/)
          HBASE-7336 HFileBlock.readAtOffset does not work well with multiple threads (Revision 1421439)

          Result = SUCCESS
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Show
          hudson Hudson added a comment - Integrated in HBase-0.94 #625 (See https://builds.apache.org/job/HBase-0.94/625/ ) HBASE-7336 HFileBlock.readAtOffset does not work well with multiple threads (Revision 1421439) Result = SUCCESS larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK #3618 (See https://builds.apache.org/job/HBase-TRUNK/3618/)
          HBASE-7336 HFileBlock.readAtOffset does not work well with multiple threads (Revision 1421440)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Show
          hudson Hudson added a comment - Integrated in HBase-TRUNK #3618 (See https://builds.apache.org/job/HBase-TRUNK/3618/ ) HBASE-7336 HFileBlock.readAtOffset does not work well with multiple threads (Revision 1421440) Result = FAILURE larsh : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #295 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/295/)
          HBASE-7336 HFileBlock.readAtOffset does not work well with multiple threads (Revision 1421440)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Show
          hudson Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #295 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/295/ ) HBASE-7336 HFileBlock.readAtOffset does not work well with multiple threads (Revision 1421440) Result = FAILURE larsh : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Note that the previous tests all were scans that did not return anything to the client.
          Did another verification test with scans in blocks and returns data to the client. Even in that case the performance is better, because of improved concurrency.
          (Three concurrent clients went from 74s to 65s)

          Show
          lhofhansl Lars Hofhansl added a comment - Note that the previous tests all were scans that did not return anything to the client. Did another verification test with scans in blocks and returns data to the client. Even in that case the performance is better, because of improved concurrency. (Three concurrent clients went from 74s to 65s)
          Hide
          lhofhansl Lars Hofhansl added a comment -

          TestHFileBlock.testConcurrentReading[1] is failing in the 0.94 test runs now (with OOMs).

          I do not think this is because of this change, but at the same time I do not see any other related changes.
          It does not fail locally.

          Show
          lhofhansl Lars Hofhansl added a comment - TestHFileBlock.testConcurrentReading [1] is failing in the 0.94 test runs now (with OOMs). I do not think this is because of this change, but at the same time I do not see any other related changes. It does not fail locally.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Running the test locally also seems to consume less memory with the patch. So I have no explanation really, why this test is failing now.

          Show
          lhofhansl Lars Hofhansl added a comment - Running the test locally also seems to consume less memory with the patch. So I have no explanation really, why this test is failing now.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          The latest test run passed.

          Looking at testConcurrentReading, it is a time bounded test. Checking the number of blocks that the concurrent readers manage to read within the time bound is actually larger without this patch (i.e. this patch is slowing this down)... Presumably because more threads are now using pread.
          (This test is reading random blocks randomly choosing pread vs not from many threads. And these blocks are very small too. So not sure how real world that is.)

          On the other hand, without this patch scanners that scan large HFiles concurrently in a tight loop are useless (i.e. never make enough progress to not time out).

          Show
          lhofhansl Lars Hofhansl added a comment - The latest test run passed. Looking at testConcurrentReading, it is a time bounded test. Checking the number of blocks that the concurrent readers manage to read within the time bound is actually larger without this patch (i.e. this patch is slowing this down)... Presumably because more threads are now using pread. (This test is reading random blocks randomly choosing pread vs not from many threads. And these blocks are very small too. So not sure how real world that is.) On the other hand, without this patch scanners that scan large HFiles concurrently in a tight loop are useless (i.e. never make enough progress to not time out).
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Somewhat tempted to revert this patch.

          Show
          lhofhansl Lars Hofhansl added a comment - Somewhat tempted to revert this patch.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Actually that test is not performance representative. When I revert this change and then have this test only do preads it is quite slow. If I have this test only do seek+read it is much faster. And that is even though the reads of these blocks are random (each thread on each iteration reads a random block), which should favor preads.

          My changes then makes this slightly slower, because the likelihood of a pread is slightly higher.

          Show
          lhofhansl Lars Hofhansl added a comment - Actually that test is not performance representative. When I revert this change and then have this test only do preads it is quite slow. If I have this test only do seek+read it is much faster. And that is even though the reads of these blocks are random (each thread on each iteration reads a random block), which should favor preads. My changes then makes this slightly slower, because the likelihood of a pread is slightly higher.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          I'll revert this in 0.94 to see whether the OOMs will go away.
          (No such problem in 0.96 interestingly)

          Show
          lhofhansl Lars Hofhansl added a comment - I'll revert this in 0.94 to see whether the OOMs will go away. (No such problem in 0.96 interestingly)
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-0.94 #632 (See https://builds.apache.org/job/HBase-0.94/632/)
          HBASE-7336 Revert due to OOMs on TestHFileBlock potentially caused by this. (Revision 1422767)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Show
          hudson Hudson added a comment - Integrated in HBase-0.94 #632 (See https://builds.apache.org/job/HBase-0.94/632/ ) HBASE-7336 Revert due to OOMs on TestHFileBlock potentially caused by this. (Revision 1422767) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Hide
          lhofhansl Lars Hofhansl added a comment -

          The 0.94 tests fail with the same OOM even without this patch, so I am going to reapply. Sorry for the noise, but I had to make sure.

          Show
          lhofhansl Lars Hofhansl added a comment - The 0.94 tests fail with the same OOM even without this patch, so I am going to reapply. Sorry for the noise, but I had to make sure.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          I reapplied the patch.
          (the other anomaly of this test is that all data fits into a single HDFS block, so of course seek+read will be favored here)

          Show
          lhofhansl Lars Hofhansl added a comment - I reapplied the patch. (the other anomaly of this test is that all data fits into a single HDFS block, so of course seek+read will be favored here)
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-0.94 #635 (See https://builds.apache.org/job/HBase-0.94/635/)
          HBASE-7336 Reapply, the OOMs were not caused by this. (Revision 1423084)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Show
          hudson Hudson added a comment - Integrated in HBase-0.94 #635 (See https://builds.apache.org/job/HBase-0.94/635/ ) HBASE-7336 Reapply, the OOMs were not caused by this. (Revision 1423084) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-0.94-security #87 (See https://builds.apache.org/job/HBase-0.94-security/87/)
          HBASE-7336 Reapply, the OOMs were not caused by this. (Revision 1423084)
          HBASE-7336 Revert due to OOMs on TestHFileBlock potentially caused by this. (Revision 1422767)
          HBASE-7336 HFileBlock.readAtOffset does not work well with multiple threads (Revision 1421439)

          Result = SUCCESS
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java

          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java

          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Show
          hudson Hudson added a comment - Integrated in HBase-0.94-security #87 (See https://builds.apache.org/job/HBase-0.94-security/87/ ) HBASE-7336 Reapply, the OOMs were not caused by this. (Revision 1423084) HBASE-7336 Revert due to OOMs on TestHFileBlock potentially caused by this. (Revision 1422767) HBASE-7336 HFileBlock.readAtOffset does not work well with multiple threads (Revision 1421439) Result = SUCCESS larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Hide
          hudson Hudson added a comment -

          Integrated in HBase-0.94-security-on-Hadoop-23 #10 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/10/)
          HBASE-7336 Reapply, the OOMs were not caused by this. (Revision 1423084)
          HBASE-7336 Revert due to OOMs on TestHFileBlock potentially caused by this. (Revision 1422767)
          HBASE-7336 HFileBlock.readAtOffset does not work well with multiple threads (Revision 1421439)

          Result = FAILURE
          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java

          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java

          larsh :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Show
          hudson Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #10 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/10/ ) HBASE-7336 Reapply, the OOMs were not caused by this. (Revision 1423084) HBASE-7336 Revert due to OOMs on TestHFileBlock potentially caused by this. (Revision 1422767) HBASE-7336 HFileBlock.readAtOffset does not work well with multiple threads (Revision 1421439) Result = FAILURE larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java larsh : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
          Hide
          vrodionov Vladimir Rodionov added a comment -

          I am looking into this stuff now, trying to figure out how to make parallel scan on a region more efficient. The code in AbstractFSReader looks dangerous and does not provide any benefits in terms of MT performance.

              protected int readAtOffset(FSDataInputStream istream,
                  byte[] dest, int destOffset, int size,
                  boolean peekIntoNextBlock, long fileOffset, boolean pread)
                  throws IOException {
                if (peekIntoNextBlock &&
                    destOffset + size + hdrSize > dest.length) {
                  // We are asked to read the next block's header as well, but there is
                  // not enough room in the array.
                  throw new IOException("Attempted to read " + size + " bytes and " +
                      hdrSize + " bytes of next header into a " + dest.length +
                      "-byte array at offset " + destOffset);
                }
          
                if (!pread && streamLock.tryLock()) {
                  // Seek + read. Better for scanning.
                  try {
                    istream.seek(fileOffset);
          
                    long realOffset = istream.getPos();
                    if (realOffset != fileOffset) {
                      throw new IOException("Tried to seek to " + fileOffset + " to "
                          + "read " + size + " bytes, but pos=" + realOffset
                          + " after seek");
                    }
          
                    if (!peekIntoNextBlock) {
                      IOUtils.readFully(istream, dest, destOffset, size);
                      return -1;
                    }
          
                    // Try to read the next block header.
                    if (!readWithExtra(istream, dest, destOffset, size, hdrSize))
                      return -1;
                  } finally {
                    streamLock.unlock();
                  }
                } else {
                  // Positional read. Better for random reads; or when the streamLock is already locked.
                  int extraSize = peekIntoNextBlock ? hdrSize : 0;
          
                  int ret = istream.read(fileOffset, dest, destOffset, size + extraSize);
                  if (ret < size) {
                    throw new IOException("Positional read of " + size + " bytes " +
                        "failed at offset " + fileOffset + " (returned " + ret + ")");
                  }
          
                  if (ret == size || ret < size + extraSize) {
                    // Could not read the next block's header, or did not try.
                    return -1;
                  }
                }
          
                assert peekIntoNextBlock;
                return Bytes.toInt(dest, destOffset + size + BlockType.MAGIC_LENGTH) +
                    hdrSize;
              }
          

          Positional reads in FSInputStream (DFSInputStream) are heavily synchronized. It is lock on stream than seek and read, unlock. Here is the code for FSInputStream:

            @Override
            public int read(long position, byte[] buffer, int offset, int length)
              throws IOException {
              synchronized (this) {
                long oldPos = getPos();
                int nread = -1;
                try {
                  seek(position);
                  nread = read(buffer, offset, length);
                } finally {
                  seek(oldPos);
                }
                return nread;
              }
            }
          

          DFSInputStream extends FSInputStream but does not override the above method. Taking into account that code is synchronized, it is hard to explain observed performance improvement published in this JIRA.

          Show
          vrodionov Vladimir Rodionov added a comment - I am looking into this stuff now, trying to figure out how to make parallel scan on a region more efficient. The code in AbstractFSReader looks dangerous and does not provide any benefits in terms of MT performance. protected int readAtOffset(FSDataInputStream istream, byte [] dest, int destOffset, int size, boolean peekIntoNextBlock, long fileOffset, boolean pread) throws IOException { if (peekIntoNextBlock && destOffset + size + hdrSize > dest.length) { // We are asked to read the next block's header as well, but there is // not enough room in the array. throw new IOException( "Attempted to read " + size + " bytes and " + hdrSize + " bytes of next header into a " + dest.length + "- byte array at offset " + destOffset); } if (!pread && streamLock.tryLock()) { // Seek + read. Better for scanning. try { istream.seek(fileOffset); long realOffset = istream.getPos(); if (realOffset != fileOffset) { throw new IOException( "Tried to seek to " + fileOffset + " to " + "read " + size + " bytes, but pos=" + realOffset + " after seek" ); } if (!peekIntoNextBlock) { IOUtils.readFully(istream, dest, destOffset, size); return -1; } // Try to read the next block header. if (!readWithExtra(istream, dest, destOffset, size, hdrSize)) return -1; } finally { streamLock.unlock(); } } else { // Positional read. Better for random reads; or when the streamLock is already locked. int extraSize = peekIntoNextBlock ? hdrSize : 0; int ret = istream.read(fileOffset, dest, destOffset, size + extraSize); if (ret < size) { throw new IOException( "Positional read of " + size + " bytes " + "failed at offset " + fileOffset + " (returned " + ret + ")" ); } if (ret == size || ret < size + extraSize) { // Could not read the next block's header, or did not try . return -1; } } assert peekIntoNextBlock; return Bytes.toInt(dest, destOffset + size + BlockType.MAGIC_LENGTH) + hdrSize; } Positional reads in FSInputStream (DFSInputStream) are heavily synchronized. It is lock on stream than seek and read, unlock. Here is the code for FSInputStream: @Override public int read( long position, byte [] buffer, int offset, int length) throws IOException { synchronized ( this ) { long oldPos = getPos(); int nread = -1; try { seek(position); nread = read(buffer, offset, length); } finally { seek(oldPos); } return nread; } } DFSInputStream extends FSInputStream but does not override the above method. Taking into account that code is synchronized, it is hard to explain observed performance improvement published in this JIRA.
          Hide
          vrodionov Vladimir Rodionov added a comment -

          Upd. The code is not dangerous - it just does not do what it was supposed to do.

          Show
          vrodionov Vladimir Rodionov added a comment - Upd. The code is not dangerous - it just does not do what it was supposed to do.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          The trylock stuff is from the patch here:

          1. try to do seek + read (if requested, i.e. a scan)
          2. if that is not possible rather than locking, do a pread immediately.

          My tests showed a significant improvement, you can always verify yourself. Note that this was 1.5 years ago.

          Curious... If what you say is true there would be no difference at all between seek+read and pread. That would indeed be bad. Hmm.

          Show
          lhofhansl Lars Hofhansl added a comment - The trylock stuff is from the patch here: try to do seek + read (if requested, i.e. a scan) if that is not possible rather than locking, do a pread immediately. My tests showed a significant improvement, you can always verify yourself. Note that this was 1.5 years ago. Curious... If what you say is true there would be no difference at all between seek+read and pread. That would indeed be bad. Hmm.
          Hide
          vrodionov Vladimir Rodionov added a comment -

          I was not right, Lars. DFSInputStream overrides positional read - no locks. But there is something else ...

          There is no much sense in allowing one random scanner run in a stream mode as since, there is no guarantee that next call to read HFile block from the "lucky" scanner will use the same streaming API and pre-cached data will still be valid. Some other scanner might dump this data before. Correct?

          You may try all pread's, for all scanners and compare performance. I bet it will be close to what we have right now.

          Show
          vrodionov Vladimir Rodionov added a comment - I was not right, Lars. DFSInputStream overrides positional read - no locks. But there is something else ... There is no much sense in allowing one random scanner run in a stream mode as since, there is no guarantee that next call to read HFile block from the "lucky" scanner will use the same streaming API and pre-cached data will still be valid. Some other scanner might dump this data before. Correct? You may try all pread 's, for all scanners and compare performance. I bet it will be close to what we have right now.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          DFSInputStream does override this method (checked Hadoop-trunk). The overridden method directly calculates the offset without locking as it should. All is good on this front.

          Show
          lhofhansl Lars Hofhansl added a comment - DFSInputStream does override this method (checked Hadoop-trunk). The overridden method directly calculates the offset without locking as it should. All is good on this front.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          I agree with your assessment on the seek+read vs pread. The current should not be worse than doing all pread, though.

          I see you commented on HBASE-5979 as well, and that would be a correct way to fix this. Each scanner would have its own stream and hence seek+read should be better there.
          The issue there is invalidation after a compaction or flush, although that needs some fixing anyway - I tried (unsuccessfully) in HBASE-10060. The issue there is that even the memory barriers taken for a lock that is uncontended 99.9999% of the time is a significant performance problem, but I have not been able to remove it and still ensure correct behavior.

          I'm glad you're looking at this, this area needs some TLC.

          Show
          lhofhansl Lars Hofhansl added a comment - I agree with your assessment on the seek+read vs pread. The current should not be worse than doing all pread, though. I see you commented on HBASE-5979 as well, and that would be a correct way to fix this. Each scanner would have its own stream and hence seek+read should be better there. The issue there is invalidation after a compaction or flush, although that needs some fixing anyway - I tried (unsuccessfully) in HBASE-10060 . The issue there is that even the memory barriers taken for a lock that is uncontended 99.9999% of the time is a significant performance problem, but I have not been able to remove it and still ensure correct behavior. I'm glad you're looking at this, this area needs some TLC.
          Hide
          vrodionov Vladimir Rodionov added a comment -

          Effect of compaction on large scan operations and vice verse

          The compaction scanner and user scanner will compete for the same input stream (DFSInputStream). This results in a sub optimal performance for both of them, becuase there is no guarantee that next call to read HFile block from the "lucky" scanner will use the same streaming API and pre-cached data will still be valid. Yep? Both scanners, periodically, switch between stream/pread API calls, hdfs cache can not be used , performance of both of them is defined by positional read performance (which is low for scan mode operation).

          Is this correct assessment?

          Show
          vrodionov Vladimir Rodionov added a comment - Effect of compaction on large scan operations and vice verse The compaction scanner and user scanner will compete for the same input stream (DFSInputStream). This results in a sub optimal performance for both of them, becuase there is no guarantee that next call to read HFile block from the "lucky" scanner will use the same streaming API and pre-cached data will still be valid . Yep? Both scanners, periodically, switch between stream/pread API calls, hdfs cache can not be used , performance of both of them is defined by positional read performance (which is low for scan mode operation). Is this correct assessment?
          Hide
          vrodionov Vladimir Rodionov added a comment -

          DFSInputStream class is heavily synchronized (at least in HDFS 2.2) and regardless of a read op type (stream, positional) all readers will be waiting on a single lock eventually. This is what I see in my local tests.

          1 scanner - 14 sec
          2 scanners - 36 sec (!!!)
          4 scanners - too long to be true.

          I have no explanation yet, but something is wrong here.

          Show
          vrodionov Vladimir Rodionov added a comment - DFSInputStream class is heavily synchronized (at least in HDFS 2.2) and regardless of a read op type (stream, positional) all readers will be waiting on a single lock eventually. This is what I see in my local tests. 1 scanner - 14 sec 2 scanners - 36 sec (!!!) 4 scanners - too long to be true. I have no explanation yet, but something is wrong here.
          Hide
          vrodionov Vladimir Rodionov added a comment -

          Forgot to mention:

          HBase 0.98.3 hadoop2. All tests are in a local mode with HBase mini cluster.

          Show
          vrodionov Vladimir Rodionov added a comment - Forgot to mention: HBase 0.98.3 hadoop2. All tests are in a local mode with HBase mini cluster.
          Hide
          stack stack added a comment -

          Vladimir Rodionov Can you try on hdfs?

          Show
          stack stack added a comment - Vladimir Rodionov Can you try on hdfs?
          Hide
          vrodionov Vladimir Rodionov added a comment -

          I monitor thread stack traces during test run. Usually, just one thread (Scanner) is running, all others are waiting on DFSInputStream in some places (as I said, too many synchronized methods). This is HDFS, not HBase.

          Show
          vrodionov Vladimir Rodionov added a comment - I monitor thread stack traces during test run. Usually, just one thread (Scanner) is running, all others are waiting on DFSInputStream in some places (as I said, too many synchronized methods). This is HDFS, not HBase.
          Hide
          xieliang007 Liang Xie added a comment -

          Yes, i observed similar problem. Long time ago i had a raw idea to impl a multi streams/readers prototype, maybe i can share the patch once ready

          Show
          xieliang007 Liang Xie added a comment - Yes, i observed similar problem. Long time ago i had a raw idea to impl a multi streams/readers prototype, maybe i can share the patch once ready
          Hide
          lhofhansl Lars Hofhansl added a comment -

          "Local mode" is not the same. It does not actually run HDFS, but just pretty ad hoc DFS wrapper.
          Tests are only valid when tested again real HDFS - even when HDFS is in single node mode.

          With actual HDFS I have not observed your numbers Vladimir Rodionov.

          Show
          lhofhansl Lars Hofhansl added a comment - "Local mode" is not the same. It does not actually run HDFS, but just pretty ad hoc DFS wrapper. Tests are only valid when tested again real HDFS - even when HDFS is in single node mode. With actual HDFS I have not observed your numbers Vladimir Rodionov .
          Hide
          vrodionov Vladimir Rodionov added a comment -

          HBase mini cluster runs HDFS and code path to access files/data is the same as in a real cluster mode, I think. At least, PackageSender/PackageReceiver(s) and all IPC stuff for NN and DN are present in stack traces.

          Show
          vrodionov Vladimir Rodionov added a comment - HBase mini cluster runs HDFS and code path to access files/data is the same as in a real cluster mode, I think. At least, PackageSender/PackageReceiver(s) and all IPC stuff for NN and DN are present in stack traces.
          Hide
          vrodionov Vladimir Rodionov added a comment -

          Liang Xie], I am working on efficient multiple scanners support for single store file as well. We need this for several purposes:

          • Improve single task granularity during query execution (currently, it is a single region)
          • Improve scanner performance during compaction(s)
          • Improve compaction performance during normal operations.

          The patch will be submitted soon when we get all testing done.

          Show
          vrodionov Vladimir Rodionov added a comment - Liang Xie ], I am working on efficient multiple scanners support for single store file as well. We need this for several purposes: Improve single task granularity during query execution (currently, it is a single region) Improve scanner performance during compaction(s) Improve compaction performance during normal operations. The patch will be submitted soon when we get all testing done.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Vladimir Rodionov, curious about how you will find good splitpoints inside a region. Regions can be assumed to be roughly of equal size (in terms of bytes, not rows), but inside a region the distribution of keys can be arbitrarily skewed, and hence unless you have more state you cannot find good splits inside a region.
          (the region split points actually are a very rough histogram for data distribution)

          Show
          lhofhansl Lars Hofhansl added a comment - Vladimir Rodionov , curious about how you will find good splitpoints inside a region. Regions can be assumed to be roughly of equal size (in terms of bytes, not rows), but inside a region the distribution of keys can be arbitrarily skewed, and hence unless you have more state you cannot find good splits inside a region. (the region split points actually are a very rough histogram for data distribution)
          Hide
          vrodionov Vladimir Rodionov added a comment -

          Lars Hofhansl

          The major priority right now is to improve compaction and normal operations during compaction. Sure we need to track region stats to make optimal inter-region splits, but even without such a stats we can decrease data skew significantly.

          Show
          vrodionov Vladimir Rodionov added a comment - Lars Hofhansl The major priority right now is to improve compaction and normal operations during compaction. Sure we need to track region stats to make optimal inter-region splits, but even without such a stats we can decrease data skew significantly.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Vladimir Rodionov, I have a quite simple patch that would allow compaction to have their own, private readers. With what you've seen, would that be a benefit?

          Show
          lhofhansl Lars Hofhansl added a comment - Vladimir Rodionov , I have a quite simple patch that would allow compaction to have their own, private readers. With what you've seen, would that be a benefit?
          Hide
          vrodionov Vladimir Rodionov added a comment -

          Lars Hofhansl

          Yes that would be helpful for some use cases (compaction + 1 application scanner). There is a (not so simple) patch in HBASE-12031 that fixes the issue for all use cases (compaction + N application scanners) but it seems that nobody has tried it yet. Phoenix parallel intra region scanners would benefit from HBASE-12031 the most.

          Show
          vrodionov Vladimir Rodionov added a comment - Lars Hofhansl Yes that would be helpful for some use cases (compaction + 1 application scanner). There is a (not so simple) patch in HBASE-12031 that fixes the issue for all use cases (compaction + N application scanners) but it seems that nobody has tried it yet. Phoenix parallel intra region scanners would benefit from HBASE-12031 the most.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          See patch in HBASE-12411, relatively easy way to have compactions work with their own scanners.

          Show
          lhofhansl Lars Hofhansl added a comment - See patch in HBASE-12411 , relatively easy way to have compactions work with their own scanners.

            People

            • Assignee:
              lhofhansl Lars Hofhansl
              Reporter:
              lhofhansl Lars Hofhansl
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development