HBase
  1. HBase
  2. HBASE-11368

Multi-column family BulkLoad fails if compactions go on too long

    Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails.

      Various recipes include:
      + Making smaller regions (lame)
      + Victor Xu suggests major compacting just before bulk loading over in HBASE-10882 as a work around.

      Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least?

      1. hbase11368-master.patch
        4 kB
        Qiang Tian
      2. key_stacktrace_hbase10882.TXT
        11 kB
        Qiang Tian
      3. performance_improvement_verification_98.5.patch
        7 kB
        Qiang Tian
      4. hbase-11368-0.98.5.patch
        4 kB
        Qiang Tian

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 5s HBASE-11368 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/latest/precommit-patchnames for help.



          Subsystem Report/Notes
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12677543/hbase11368-master.patch
          JIRA Issue HBASE-11368
          Powered by Apache Yetus 0.1.0 http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-HBASE-Build/331/console

          This message was automatically generated.

          Show
          Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 5s HBASE-11368 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/latest/precommit-patchnames for help. Subsystem Report/Notes JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12677543/hbase11368-master.patch JIRA Issue HBASE-11368 Powered by Apache Yetus 0.1.0 http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-HBASE-Build/331/console This message was automatically generated.
          Hide
          Jerry He added a comment -

          This ticket can be closed because the only sub-task would fix the problem. Right?

          Show
          Jerry He added a comment - This ticket can be closed because the only sub-task would fix the problem. Right?
          Hide
          Nick Dimiduk added a comment -

          any current on going scan will not be able to see the bulk loaded hfiles which is loaded just after the current scan has started. I think that behaviour should be acceptable, right?

          I believe this should be correct behavior, yes.

          Show
          Nick Dimiduk added a comment - any current on going scan will not be able to see the bulk loaded hfiles which is loaded just after the current scan has started. I think that behaviour should be acceptable, right? I believe this should be correct behavior, yes.
          Hide
          ramkrishna.s.vasudevan added a comment -

          This comment https://issues.apache.org/jira/browse/HBASE-11368?focusedCommentId=14693166&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14693166 is getting addressed as part of HBASE-13082. So doing that JIRA would mean that any current on going scan will not be able to see the bulk loaded hfiles which is loaded just after the current scan has started. I think that behaviour should be acceptable, right?

          Show
          ramkrishna.s.vasudevan added a comment - This comment https://issues.apache.org/jira/browse/HBASE-11368?focusedCommentId=14693166&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14693166 is getting addressed as part of HBASE-13082 . So doing that JIRA would mean that any current on going scan will not be able to see the bulk loaded hfiles which is loaded just after the current scan has started. I think that behaviour should be acceptable, right?
          Hide
          Nick Dimiduk added a comment -

          FYI, opened subtask HBASE-14575 to give Devaraj Das's idea a spin. Mind having a look?

          Show
          Nick Dimiduk added a comment - FYI, opened subtask HBASE-14575 to give Devaraj Das 's idea a spin. Mind having a look?
          Hide
          stack added a comment -

          Nick Dimiduk I like the idea of narrowing the lock scope but started to look and its a bit of a rats nest where locks are held (compactions checking on each row seems well dodgy... ) Yeah, a review of the attempt at undoing scanner locks so only a region-level lock sounds like it would help.

          Show
          stack added a comment - Nick Dimiduk I like the idea of narrowing the lock scope but started to look and its a bit of a rats nest where locks are held (compactions checking on each row seems well dodgy... ) Yeah, a review of the attempt at undoing scanner locks so only a region-level lock sounds like it would help.
          Hide
          Lars Hofhansl added a comment - - edited

          See also discussion in HBASE-13082. (somewhat related, but talks about the locks in StoreScanner that we need to safely reset the scanner stack)

          Show
          Lars Hofhansl added a comment - - edited See also discussion in HBASE-13082 . (somewhat related, but talks about the locks in StoreScanner that we need to safely reset the scanner stack)
          Hide
          Nick Dimiduk added a comment -

          Looks like HBASE-6028 wants to implement the meat of what I've proposed above. It also happens to be 2/3 of the work for HBASE-12446. Seems like good bang for the buck on this approach.

          Chatting with Enis Soztutar and Devaraj Das about this offline. Another idea is we can reduce the scope of when the read lock is held during compaction. In theory the compactor only needs a region read lock while deciding what files to compact and at the time of committing the compaction. We're protected from the case of region close events because compactions are checking (between every Cell!) if the store has been closed in order to abort in such a case. Is there another reason why we would want to hold the read lock for the entire duration of the compaction? stack Lars Hofhansl?

          Show
          Nick Dimiduk added a comment - Looks like HBASE-6028 wants to implement the meat of what I've proposed above. It also happens to be 2/3 of the work for HBASE-12446 . Seems like good bang for the buck on this approach. Chatting with Enis Soztutar and Devaraj Das about this offline. Another idea is we can reduce the scope of when the read lock is held during compaction. In theory the compactor only needs a region read lock while deciding what files to compact and at the time of committing the compaction. We're protected from the case of region close events because compactions are checking (between every Cell!) if the store has been closed in order to abort in such a case. Is there another reason why we would want to hold the read lock for the entire duration of the compaction? stack Lars Hofhansl ?
          Hide
          Nick Dimiduk added a comment -

          Check my facts:

          1. the only uses of the write lock are region close and bulkload, two rare events.
          2. the only long-running read lock holders are compactions, frequent events.

          What if we allow write lock access requests to interrupt running compactions?

          Show
          Nick Dimiduk added a comment - Check my facts: the only uses of the write lock are region close and bulkload, two rare events. the only long-running read lock holders are compactions, frequent events. What if we allow write lock access requests to interrupt running compactions?
          Hide
          Esteban Gutierrez added a comment -

          Hey Qiang Tian are you still working on this?

          Also I agree with Enis Soztutar regarding ref-counting might be an alternative but I couldn't find the JIRA for that, any pointer Lars Hofhansl?

          Show
          Esteban Gutierrez added a comment - Hey Qiang Tian are you still working on this? Also I agree with Enis Soztutar regarding ref-counting might be an alternative but I couldn't find the JIRA for that, any pointer Lars Hofhansl ?
          Hide
          Enis Soztutar added a comment -

          How? Would the following case be true without the bulk load getting the region write lock?

          We do not do this now, but in theory it can be done similarly to regular writes.

          • obtain new seqId as a write transaction
          • bulk load all files across CFs with the seqId.
          • advance mvcc read point only when all bulk loads are complete.
            This way the scanners are guaranteed to atomically observe the bulk loaded data atomically without the region-write-lock.

            In the 0.98 code line, we don't have seqid, and the atomicity is still guaranteed there.

            Yes. Not worth changing 0.98 line.

            I think it is being propagated properly to the scanner. Think about the same notifyChangedReadersObservers is being used at the end of compaction and flushes as well. The reset of the readers should work.

            I am not sure about that. Agreed that the cells at the store level will actually get re-ordered, but the heap at the region level is never re-ordered. So, after a bulk load, the ordering of store scanners at the region level might change, but the scanner will miss it if I understand this correctly.

            Atomicity may be a false blanket considering HBASE-4652 is still unresolved.

            Very good point. We need a transactional commit for the BL files.

          Show
          Enis Soztutar added a comment - How? Would the following case be true without the bulk load getting the region write lock? We do not do this now, but in theory it can be done similarly to regular writes. obtain new seqId as a write transaction bulk load all files across CFs with the seqId. advance mvcc read point only when all bulk loads are complete. This way the scanners are guaranteed to atomically observe the bulk loaded data atomically without the region-write-lock. In the 0.98 code line, we don't have seqid, and the atomicity is still guaranteed there. Yes. Not worth changing 0.98 line. I think it is being propagated properly to the scanner. Think about the same notifyChangedReadersObservers is being used at the end of compaction and flushes as well. The reset of the readers should work. I am not sure about that. Agreed that the cells at the store level will actually get re-ordered, but the heap at the region level is never re-ordered. So, after a bulk load, the ordering of store scanners at the region level might change, but the scanner will miss it if I understand this correctly. Atomicity may be a false blanket considering HBASE-4652 is still unresolved. Very good point. We need a transactional commit for the BL files.
          Hide
          Jerry He added a comment -

          You are right, Nick. That is the unsolved issue.

          Show
          Jerry He added a comment - You are right, Nick. That is the unsolved issue.
          Hide
          Nick Dimiduk added a comment -

          Atomicity may be a false blanket considering HBASE-4652 is still unresolved.

          Show
          Nick Dimiduk added a comment - Atomicity may be a false blanket considering HBASE-4652 is still unresolved.
          Hide
          Jerry He added a comment -

          but that should be guaranteed with the seqId / mvcc combination and not via region write lock.

          How? Would the following case be true without the bulk load getting the region write lock?
          a. the bulk load obtain a seqId
          b. a read request comes in and gets the seqId as mvcc.
          c. The read will be able to see the partially loaded data while the bulk is still in process

          In the 0.98 code line, we don't have seqid, and the atomicity is still guaranteed there.

          On bulk load, we call HStore.notifyChangedReadersObservers(), which resets the KVHeap, but we never reset the RegionScanner from my reading of code. Is this a bug?

          I think it is being propagated properly to the scanner. Think about the same notifyChangedReadersObservers is being used at the end of compaction and flushes as well. The reset of the readers should work.

          I think the region write lock is still the only guarantee for bulk load atomicity. On the high level, the region scan and next calls are within the region read lock, which is mutually elusive with bulk load process which needs the region write lock. This is heavy.

          Show
          Jerry He added a comment - but that should be guaranteed with the seqId / mvcc combination and not via region write lock. How? Would the following case be true without the bulk load getting the region write lock? a. the bulk load obtain a seqId b. a read request comes in and gets the seqId as mvcc. c. The read will be able to see the partially loaded data while the bulk is still in process In the 0.98 code line, we don't have seqid, and the atomicity is still guaranteed there. On bulk load, we call HStore.notifyChangedReadersObservers(), which resets the KVHeap, but we never reset the RegionScanner from my reading of code. Is this a bug? I think it is being propagated properly to the scanner. Think about the same notifyChangedReadersObservers is being used at the end of compaction and flushes as well. The reset of the readers should work. I think the region write lock is still the only guarantee for bulk load atomicity. On the high level, the region scan and next calls are within the region read lock, which is mutually elusive with bulk load process which needs the region write lock. This is heavy.
          Hide
          Enis Soztutar added a comment -

          I was reading HBASE-4552 and RegionScannerImpl code again to try to understand why we need the write lock for multi-CF bulk loads in the first place. It seems that it was put there to ensure atomicity, but that should be guaranteed with the seqId / mvcc combination and not via region write lock. However, the bulk load files obtain a seqId, and acquiring the region write lock will block all flushes which may be the reason. On bulk load, we call HStore.notifyChangedReadersObservers(), which resets the KVHeap, but we never reset the RegionScanner from my reading of code. Is this a bug? The current scanners should not see the new bulk loaded data (via mvcc) so maybe it is ok?

          Show
          Enis Soztutar added a comment - I was reading HBASE-4552 and RegionScannerImpl code again to try to understand why we need the write lock for multi-CF bulk loads in the first place. It seems that it was put there to ensure atomicity, but that should be guaranteed with the seqId / mvcc combination and not via region write lock. However, the bulk load files obtain a seqId, and acquiring the region write lock will block all flushes which may be the reason. On bulk load, we call HStore.notifyChangedReadersObservers(), which resets the KVHeap, but we never reset the RegionScanner from my reading of code. Is this a bug? The current scanners should not see the new bulk loaded data (via mvcc) so maybe it is ok?
          Hide
          Enis Soztutar added a comment -

          My concern with the patch is that it is acquiring yet another lock per get/scan on top of the already existing ones. Agreed that the region close lock is abused here for multi-CF bulkloads and have to be fixed.

          I believe the actual long term solution to this is to do ref-counting to Store files in the store, and have the store file list per scan immutable. Then we do not need the costly mechanism for keeping the store files updated between KVHea, scanner and store file list (notifyChangedReadersObservers). leveldb is doing ref counting for files I believe. Lars Hofhansl you had a jira for this?

          Show
          Enis Soztutar added a comment - My concern with the patch is that it is acquiring yet another lock per get/scan on top of the already existing ones. Agreed that the region close lock is abused here for multi-CF bulkloads and have to be fixed. I believe the actual long term solution to this is to do ref-counting to Store files in the store, and have the store file list per scan immutable. Then we do not need the costly mechanism for keeping the store files updated between KVHea, scanner and store file list ( notifyChangedReadersObservers ). leveldb is doing ref counting for files I believe. Lars Hofhansl you had a jira for this?
          Hide
          Stephen Yuan Jiang added a comment -

          Qiang Tian and stack, any update or concern on this patch? We have a customer seeing this issue recently.

          Show
          Stephen Yuan Jiang added a comment - Qiang Tian and stack , any update or concern on this patch? We have a customer seeing this issue recently.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12677543/hbase11368-master.patch
          against trunk revision .
          ATTACHMENT ID: 12677543

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 checkstyle. The applied patch generated 3781 checkstyle errors (more than the trunk's current 3780 errors).

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 lineLengths. The patch does not introduce lines longer than 100

          +1 site. The mvn site goal succeeds with this patch.

          -1 core tests. The patch failed these unit tests:
          org.apache.hadoop.hbase.client.TestHCM
          org.apache.hadoop.hbase.master.TestSplitLogManager

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
          Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/checkstyle-aggregate.html

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12677543/hbase11368-master.patch against trunk revision . ATTACHMENT ID: 12677543 +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. -1 checkstyle . The applied patch generated 3781 checkstyle errors (more than the trunk's current 3780 errors). +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. -1 core tests . The patch failed these unit tests: org.apache.hadoop.hbase.client.TestHCM org.apache.hadoop.hbase.master.TestSplitLogManager Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11489//console This message is automatically generated.
          Hide
          Qiang Tian added a comment -

          patch for master branch

          Show
          Qiang Tian added a comment - patch for master branch
          Hide
          Qiang Tian added a comment -

          the attachments:
          key_stacktrace_hbase10882.TXT : the problem stacktrace
          hbase-11368-0.98.5.patch : the fix
          performance_improvement_verification_98.5.patch: the testcase to verify performance improvement

          Show
          Qiang Tian added a comment - the attachments: key_stacktrace_hbase10882.TXT : the problem stacktrace hbase-11368-0.98.5.patch : the fix performance_improvement_verification_98.5.patch : the testcase to verify performance improvement
          Hide
          Qiang Tian added a comment -

          Hi stack,
          Sorry for confusing. let me explain from scratch:
          1)the root cause of problem - HRegion#lock.
          From the stacktrace in HBASE-10882(also see key_stacktrace_hbase10882.TXT attached), the event sequence is:
          1.1)the compaction acquires the readlock of HRegion#lock,
          1.2)the bulkload try to acquire the writelock of HRegion#lock if there are multiple CFs. it has to wait for compaction to release the readlock.
          1.3)scanners try to acquire the readlock of HRegion#lock. they have to wait for the bulkload to release the writelock.
          so both bulkload and scanners are blocked on HRegion#lock by compaction.

          2)what is HRegion#lock used for?
          Investigation on the HRegion#lock shows, it is originally designed to protect region close ONLY. if someone, such as region split, wants to close the region, it needs to wait for others release the readlock.
          Then HBASE-4552 used the lock to solve the multi-CF bulkload consistency issue. now we see it is too heavy.

          3)can we not use HRegion#lock in bulkload?
          the answer is yes.
          Internally, HStore#DefaultStoreFileManager#storefiles keeps track of the on-disk HFiles for a CF. we have below steps for the bulkload:
          3.1)moves HFiles directly to region directory
          3.2)add them into the storefiles list
          3.3)notify StoreScanner that the HFile list is changed, which is done by resetting the StoreScanner#heap to null. this forces existing StoreScanner instances to reinitialize based on new the HFiles seen on disk in next scan/read request.
          the step 3.2 and 3.3 is synchronized by HStore#lock. so we have CF level scan-bulkload consistency.

          To achieve multi-CF scan-bulkload consistency, if we do not use HRegion#lock, we still need another region level lock — a RegionScanner is composed of multiple StoreScanner, a StoreScanner(a CF scanner) is composed of a MemStoreScanner and multiple StoreFileScanner.

          the RegionScannerImpl#sortheap(and joinedHeap) is just the entry point of multiple StoreScanners. to have multi-CF consistency, we need synchronization here - a lock is needed, but it is used only between scan and bulkload.

          Regarding the code change you referenced, performance_improvement_verification_98.5.patch is to simulate the event sequence described in #1, for testing purpose only.

          currently I use 98.5 for test since it is stable and easy to evaluate the effect of the change.
          thanks.

          Show
          Qiang Tian added a comment - Hi stack , Sorry for confusing. let me explain from scratch: 1)the root cause of problem - HRegion#lock. From the stacktrace in HBASE-10882 (also see key_stacktrace_hbase10882.TXT attached), the event sequence is: 1.1)the compaction acquires the readlock of HRegion#lock, 1.2)the bulkload try to acquire the writelock of HRegion#lock if there are multiple CFs. it has to wait for compaction to release the readlock. 1.3)scanners try to acquire the readlock of HRegion#lock. they have to wait for the bulkload to release the writelock. so both bulkload and scanners are blocked on HRegion#lock by compaction. 2)what is HRegion#lock used for? Investigation on the HRegion#lock shows, it is originally designed to protect region close ONLY. if someone, such as region split, wants to close the region, it needs to wait for others release the readlock. Then HBASE-4552 used the lock to solve the multi-CF bulkload consistency issue. now we see it is too heavy. 3)can we not use HRegion#lock in bulkload? the answer is yes. Internally, HStore#DefaultStoreFileManager#storefiles keeps track of the on-disk HFiles for a CF. we have below steps for the bulkload: 3.1)moves HFiles directly to region directory 3.2)add them into the storefiles list 3.3)notify StoreScanner that the HFile list is changed, which is done by resetting the StoreScanner#heap to null. this forces existing StoreScanner instances to reinitialize based on new the HFiles seen on disk in next scan/read request. the step 3.2 and 3.3 is synchronized by HStore#lock. so we have CF level scan-bulkload consistency. To achieve multi-CF scan-bulkload consistency, if we do not use HRegion#lock, we still need another region level lock — a RegionScanner is composed of multiple StoreScanner, a StoreScanner(a CF scanner) is composed of a MemStoreScanner and multiple StoreFileScanner. the RegionScannerImpl#sortheap(and joinedHeap) is just the entry point of multiple StoreScanners. to have multi-CF consistency, we need synchronization here - a lock is needed, but it is used only between scan and bulkload. Regarding the code change you referenced, performance_improvement_verification_98.5.patch is to simulate the event sequence described in #1, for testing purpose only. currently I use 98.5 for test since it is stable and easy to evaluate the effect of the change. thanks.
          Hide
          stack added a comment -

          Is this meant to be in the patch?

          1449 LOG.info("###compaction get the closelock, sleep 20s to simulate slow compaction");
          1450 try

          { 1451 Thread.sleep(20000); 1452 }

          catch (InterruptedException e)

          { 1453 LOG.info("###sleep interrupted"); 1454 }

          What change did you do Qiang Tian?

          Show
          stack added a comment - Is this meant to be in the patch? 1449 LOG.info("###compaction get the closelock, sleep 20s to simulate slow compaction"); 1450 try { 1451 Thread.sleep(20000); 1452 } catch (InterruptedException e) { 1453 LOG.info("###sleep interrupted"); 1454 } What change did you do Qiang Tian ?
          Hide
          Qiang Tian added a comment -

          Hi stack, Andrew Purtell,
          any comments?
          thanks!

          Show
          Qiang Tian added a comment - Hi stack , Andrew Purtell , any comments? thanks!
          Hide
          Qiang Tian added a comment -

          A simple comparison test using updated TestHRegionServerBulkLoad.java the number is for just for reference. the real perf improvement might depend on a combination of factors, such as campaction time, bulkload time, scan/read workload type, request currency etc)

          98.5:
          -----------
          2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(345): loaded 16
          2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(346): compations 16

          2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(348): Scanners:
          //average # with 50 scanners
          2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(350): scanned 73
          2014-10-24 02:30:03,400 INFO [main] regionserver.TestHRegionServerBulkLoad(351): verified 18000 rows

          98.5+patch
          --------
          //since bulkload has smaller conflict with compaction, we get more bulkload/compaction request in fixed test cycle(5 minutes)
          2014-10-24 02:41:19,071 INFO [main] regionserver.TestHRegionServerBulkLoad(344): Loaders:
          2014-10-24 02:41:19,072 INFO [main] regionserver.TestHRegionServerBulkLoad(345): loaded 43
          2014-10-24 02:41:19,072 INFO [main] regionserver.TestHRegionServerBulkLoad(346): compations 43

          2014-10-24 02:41:19,073 INFO [main] regionserver.TestHRegionServerBulkLoad(348): Scanners:
          //since bulkload has smaller conflict with scan, we get more scans in fixed test cycle(5 minutes)
          //average # for 50 scanners
          2014-10-24 02:41:19,073 INFO [main] regionserver.TestHRegionServerBulkLoad(350): scanned 92
          2014-10-24 02:41:19,073 INFO [main] regionserver.TestHRegionServerBulkLoad(351): verified 25000 rows

          Show
          Qiang Tian added a comment - A simple comparison test using updated TestHRegionServerBulkLoad.java the number is for just for reference. the real perf improvement might depend on a combination of factors, such as campaction time, bulkload time, scan/read workload type, request currency etc) 98.5: ----------- 2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(345): loaded 16 2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(346): compations 16 2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(348): Scanners: //average # with 50 scanners 2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(350): scanned 73 2014-10-24 02:30:03,400 INFO [main] regionserver.TestHRegionServerBulkLoad(351): verified 18000 rows 98.5+patch -------- //since bulkload has smaller conflict with compaction, we get more bulkload/compaction request in fixed test cycle(5 minutes) 2014-10-24 02:41:19,071 INFO [main] regionserver.TestHRegionServerBulkLoad(344): Loaders: 2014-10-24 02:41:19,072 INFO [main] regionserver.TestHRegionServerBulkLoad(345): loaded 43 2014-10-24 02:41:19,072 INFO [main] regionserver.TestHRegionServerBulkLoad(346): compations 43 2014-10-24 02:41:19,073 INFO [main] regionserver.TestHRegionServerBulkLoad(348): Scanners: //since bulkload has smaller conflict with scan, we get more scans in fixed test cycle(5 minutes) //average # for 50 scanners 2014-10-24 02:41:19,073 INFO [main] regionserver.TestHRegionServerBulkLoad(350): scanned 92 2014-10-24 02:41:19,073 INFO [main] regionserver.TestHRegionServerBulkLoad(351): verified 25000 rows
          Hide
          Qiang Tian added a comment -

          initial YCSB test:

          Env:

          hadoop 2.2.0
          YCSB 1.0.4(Andrew's branch)
          3 nodes, 1 master, 2 RS //ignore cluster details since just to evaluate the new lock

          Steps:

          Followed Andrew's steps(see http://search-hadoop.com/m/DHED4hl7pC/)
          the seed table has 3 CFs, pre-split to 20 regions
          load 1 million rows to CF 'f1', using workloada
          run 3 iterations for workloadc and workloada respectively. the parameter in each run:

          -p columnfamily=f1 -p operationcount=1000000 -s -threads 10

          Results:

          0.98.5:
          workload c:
          [READ], AverageLatency(us), 496.225811
          [READ], AverageLatency(us), 510.206831
          [READ], AverageLatency(us), 501.256123

          workload a:
          [READ], AverageLatency(us), 676.4527555821747
          [READ], AverageLatency(us), 622.5544771452717
          [READ], AverageLatency(us), 628.1365657163067

          0.98.5+patch:
          workload c:
          [READ], AverageLatency(us), 536.334437
          [READ], AverageLatency(us), 508.405555
          [READ], AverageLatency(us), 491.416182

          workload a:
          [READ], AverageLatency(us), 640.3625218319231
          [READ], AverageLatency(us), 642.9719823488798
          [READ], AverageLatency(us), 631.7491770928287

          looks little performance penalty.

          I also ran PE in the cluster, since the test table has only 1 CF, the new lock is actually not used. interestingly, with the patch the performance is even a bit better...

          Show
          Qiang Tian added a comment - initial YCSB test: Env: — hadoop 2.2.0 YCSB 1.0.4(Andrew's branch) 3 nodes, 1 master, 2 RS //ignore cluster details since just to evaluate the new lock Steps: — Followed Andrew's steps(see http://search-hadoop.com/m/DHED4hl7pC/ ) the seed table has 3 CFs, pre-split to 20 regions load 1 million rows to CF 'f1', using workloada run 3 iterations for workloadc and workloada respectively. the parameter in each run: -p columnfamily=f1 -p operationcount=1000000 -s -threads 10 Results: — 0.98.5: workload c: [READ] , AverageLatency(us), 496.225811 [READ] , AverageLatency(us), 510.206831 [READ] , AverageLatency(us), 501.256123 workload a: [READ] , AverageLatency(us), 676.4527555821747 [READ] , AverageLatency(us), 622.5544771452717 [READ] , AverageLatency(us), 628.1365657163067 0.98.5+patch: workload c: [READ] , AverageLatency(us), 536.334437 [READ] , AverageLatency(us), 508.405555 [READ] , AverageLatency(us), 491.416182 workload a: [READ] , AverageLatency(us), 640.3625218319231 [READ] , AverageLatency(us), 642.9719823488798 [READ] , AverageLatency(us), 631.7491770928287 looks little performance penalty. I also ran PE in the cluster, since the test table has only 1 CF, the new lock is actually not used. interestingly, with the patch the performance is even a bit better...
          Hide
          Qiang Tian added a comment -

          it looks to me the patch could show the value only when there is long compaction + gets/scans, not sure if Victor Xu wants to try it in some test env?
          thanks.

          Show
          Qiang Tian added a comment - it looks to me the patch could show the value only when there is long compaction + gets/scans, not sure if Victor Xu wants to try it in some test env? thanks.
          Hide
          Qiang Tian added a comment -

          I forgot StoreScanner is per CF..earlier analysis is wrong:

          After DefaultStoreFileManager#storefiles is updated in HStore#bulkLoadHFile, notifyChangedReadersObservers is called to reset the StoreScanner#heap, so checkReseek->resetScannerStack will be triggered in next scan/read to recreate store scanners based on new storefiles.

          so we could introduce a new region level rwlock multiCFLock, HRegion#bulkLoadHFiles acquires the writelock before multi-CF HStore.bulkLoadHFile call. and StoreScanner#resetScannerStack acquires the readlock. this way the scanners are recreated after all CFs' store files are populated.

          instead, the new lock should put at regionScanner layer. see the patch attached.

          the "mvn test" and "TestHRegionServerBulkLoad"(large test for atomic bulkload test) passed, still need to run large tests and performance test(any suggestions for it? YCSB?).

          the lock can be further limited to a smaller scope by split HStore#bulkLoadHFile into 2 parts:1) rename the bulkload files and put new files into store files list 2) notifyChangedReadersObservers. only #2 needs the lock.
          if HDFS file rename is fast, the split may not be needed.

          Show
          Qiang Tian added a comment - I forgot StoreScanner is per CF..earlier analysis is wrong: After DefaultStoreFileManager#storefiles is updated in HStore#bulkLoadHFile, notifyChangedReadersObservers is called to reset the StoreScanner#heap, so checkReseek->resetScannerStack will be triggered in next scan/read to recreate store scanners based on new storefiles. so we could introduce a new region level rwlock multiCFLock, HRegion#bulkLoadHFiles acquires the writelock before multi-CF HStore.bulkLoadHFile call. and StoreScanner#resetScannerStack acquires the readlock. this way the scanners are recreated after all CFs' store files are populated. instead, the new lock should put at regionScanner layer. see the patch attached. the "mvn test" and "TestHRegionServerBulkLoad"(large test for atomic bulkload test) passed, still need to run large tests and performance test(any suggestions for it? YCSB?). the lock can be further limited to a smaller scope by split HStore#bulkLoadHFile into 2 parts:1) rename the bulkload files and put new files into store files list 2) notifyChangedReadersObservers. only #2 needs the lock. if HDFS file rename is fast, the split may not be needed.
          Hide
          Qiang Tian added a comment -

          update:
          the idea will cause deadlock since bulkload and scanner follow different orders to acquire bulkload lock and StoreScanner.lock. will look at if we could lower the granularity of storescanner lock.

          Show
          Qiang Tian added a comment - update: the idea will cause deadlock since bulkload and scanner follow different orders to acquire bulkload lock and StoreScanner.lock. will look at if we could lower the granularity of storescanner lock.
          Hide
          Jerry He added a comment -
            /**
             * Atomic bulk load.
             */
            @Test
            public void testAtomicBulkLoad() throws Exception {
              String TABLE_NAME = "atomicBulkLoad";
          
              int millisToRun = 30000;
          

          This test case is 30 sec.

            /**
             * Run test on an HBase instance for 5 minutes. This assumes that the table
             * under test only has a single region.
             */
            public static void main(String args[]) throws Exception {
          

          main is not invoked during JUnit run.

          Show
          Jerry He added a comment - /** * Atomic bulk load. */ @Test public void testAtomicBulkLoad() throws Exception { String TABLE_NAME = "atomicBulkLoad" ; int millisToRun = 30000; This test case is 30 sec. /** * Run test on an HBase instance for 5 minutes. This assumes that the table * under test only has a single region. */ public static void main( String args[]) throws Exception { main is not invoked during JUnit run.
          Hide
          Qiang Tian added a comment -

          Thanks Jerry He,
          is it right way to run the bulkload test? mvn test -Dtest=TestHRegionServerBulkLoad
          the test is supposed to run for 5 minutes, but only after about 1 minutes then it exits. is it expected?

          Show
          Qiang Tian added a comment - Thanks Jerry He , is it right way to run the bulkload test? mvn test -Dtest=TestHRegionServerBulkLoad the test is supposed to run for 5 minutes, but only after about 1 minutes then it exits. is it expected?
          Hide
          Jerry He added a comment -

          Other region operations: only acquire region read lock

          --> Other region operations: only acquire region read or write lock, no change from existing behavior.

          Show
          Jerry He added a comment - Other region operations: only acquire region read lock --> Other region operations: only acquire region read or write lock, no change from existing behavior.
          Hide
          Jerry He added a comment -

          Hi, Qiang Tian

          The idea may be feasible:

          bulk load begins: acquire region read lock + new bulk load write lock.
          Scan//next begins: acquire region read lock + new bulk load read lock.
          Other region operations: only acquire region read lock

          This will save the compaction and bulk load from blocking each other.

          Do you mind drafting a patch and run thru the test suite?

          Show
          Jerry He added a comment - Hi, Qiang Tian The idea may be feasible: bulk load begins: acquire region read lock + new bulk load write lock. Scan//next begins: acquire region read lock + new bulk load read lock. Other region operations: only acquire region read lock This will save the compaction and bulk load from blocking each other. Do you mind drafting a patch and run thru the test suite?
          Hide
          Qiang Tian added a comment -

          ideas for lowering down the lock granularity(based on 0.98.5 code base)
          1)read/scan
          is it the primary goal for atomic multi-CF bulkload in HBASE-4552?

          After DefaultStoreFileManager#storefiles is updated in HStore#bulkLoadHFile, notifyChangedReadersObservers is called to reset the StoreScanner#heap, so checkReseek->resetScannerStack will be triggered in next scan/read to recreate store scanners based on new storefiles.

          so we could introduce a new region level rwlock multiCFLock, HRegion#bulkLoadHFiles acquires the writelock before multi-CF HStore.bulkLoadHFile call. and StoreScanner#resetScannerStack acquires the readlock. this way the scanners are recreated after all CFs' store files are populated.

          2)split region.
          the region will be closed in SplitTransaction#stepsBeforePONR, which falls into the HRegion#lock protection area. bulk load still still need to acquire its readlock at start.

          3) memstore flush.
          we flush to a new file which is not related to the loaded files.

          4)compaction.
          the compaction is performed store by store. if bulkload inserts new files to storefiles during the selectCompaction process, the file list to be compacted might be impacted. e.g., the compaction for some CF do not include new loaded files, while others might include. but this does not impact the data integrity and read behavior?
          at the end of compaction, storefiles access is still protected by HStore#lock if there is bulk load change to the same CF.

          comments?
          thanks

          Show
          Qiang Tian added a comment - ideas for lowering down the lock granularity(based on 0.98.5 code base) 1)read/scan is it the primary goal for atomic multi-CF bulkload in HBASE-4552 ? After DefaultStoreFileManager#storefiles is updated in HStore#bulkLoadHFile, notifyChangedReadersObservers is called to reset the StoreScanner#heap, so checkReseek->resetScannerStack will be triggered in next scan/read to recreate store scanners based on new storefiles. so we could introduce a new region level rwlock multiCFLock, HRegion#bulkLoadHFiles acquires the writelock before multi-CF HStore.bulkLoadHFile call. and StoreScanner#resetScannerStack acquires the readlock. this way the scanners are recreated after all CFs' store files are populated. 2)split region. the region will be closed in SplitTransaction#stepsBeforePONR, which falls into the HRegion#lock protection area. bulk load still still need to acquire its readlock at start. 3) memstore flush. we flush to a new file which is not related to the loaded files. 4)compaction. the compaction is performed store by store. if bulkload inserts new files to storefiles during the selectCompaction process, the file list to be compacted might be impacted. e.g., the compaction for some CF do not include new loaded files, while others might include. but this does not impact the data integrity and read behavior? at the end of compaction, storefiles access is still protected by HStore#lock if there is bulk load change to the same CF. comments? thanks
          Hide
          Qiang Tian added a comment -

          As stack mentioned in http://search-hadoop.com/m/DHED4NR0wT, the HRegion#lock is to protect region close. the comments in HRegion.java and the fact that only HRegion#doClose locks the writelock(if we do not consider HRegion#startBulkRegionOperation) also show that.

          so using HRegion#lock to protect multi-CF bulkload in HBASE-4552 looks too heavy-weight?
          from the stacktrace of HBASE-10882, all the read/scan are blocked since bulkload is waiting for lock.writelock, however compaction already acquired lock.readlock and is reading data, a time-consuming operation.

          and related topic is discussed again in http://search-hadoop.com/m/DHED4I11p31. perhaps we need another region level lock.

          Show
          Qiang Tian added a comment - As stack mentioned in http://search-hadoop.com/m/DHED4NR0wT , the HRegion#lock is to protect region close. the comments in HRegion.java and the fact that only HRegion#doClose locks the writelock(if we do not consider HRegion#startBulkRegionOperation) also show that. so using HRegion#lock to protect multi-CF bulkload in HBASE-4552 looks too heavy-weight? from the stacktrace of HBASE-10882 , all the read/scan are blocked since bulkload is waiting for lock.writelock, however compaction already acquired lock.readlock and is reading data, a time-consuming operation. and related topic is discussed again in http://search-hadoop.com/m/DHED4I11p31 . perhaps we need another region level lock.
          Show
          stack added a comment - This is an old issue, http://search-hadoop.com/m/0AGoj1C1AXY/org.apache.hadoop.hbase.RegionTooBusyException%253A+failed+to+get+a+lock+in+60000ms&subj=Bulk+loading+HFiles+via+LoadIncrementalHFiles+fails+at+a+region+that+is+being+compacted+a+bug+

            People

            • Assignee:
              Qiang Tian
              Reporter:
              stack
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:

                Development