HBase
  1. HBase
  2. HBASE-4742

Split dead server's log in parallel

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Resolving this issue as has already been fixed in trunk.

      Description

      When one region server goes down, the master will shutdown the region server and split its log.
      However, splitting log is a blocking call and it would take some time.

      If more than one region server go down, the master will split its log one by one, which is not efficient.
      Since we have the distributed log split, we could split these logs from the dead servers in parallel.

      1. ASF.LICENSE.NOT.GRANTED--D237.10.patch
        30 kB
        Phabricator
      2. ASF.LICENSE.NOT.GRANTED--D237.9.patch
        24 kB
        Phabricator
      3. ASF.LICENSE.NOT.GRANTED--D237.8.patch
        13 kB
        Phabricator
      4. ASF.LICENSE.NOT.GRANTED--D237.7.patch
        13 kB
        Phabricator
      5. ASF.LICENSE.NOT.GRANTED--D237.6.patch
        12 kB
        Phabricator
      6. ASF.LICENSE.NOT.GRANTED--D237.5.patch
        13 kB
        Phabricator
      7. ASF.LICENSE.NOT.GRANTED--D237.4.patch
        9 kB
        Phabricator
      8. ASF.LICENSE.NOT.GRANTED--D237.3.patch
        10 kB
        Phabricator
      9. ASF.LICENSE.NOT.GRANTED--D237.2.patch
        10 kB
        Phabricator
      10. ASF.LICENSE.NOT.GRANTED--D237.1.patch
        10 kB
        Phabricator

        Activity

        Hide
        Phabricator added a comment -

        Karthik has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:65 Could we write a one-line comment about the state? I think ABORT means there is no need to split any logs, right?
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:331 Might be missing something here... this break will just cause the operation to return as if it succeeded, but log splitting is still in progress right? We would then proceed to open the regions without ensuring this is completed?

        Maybe we need:
        this.requeue();
        return false;

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - Karthik has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:65 Could we write a one-line comment about the state? I think ABORT means there is no need to split any logs, right? src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:331 Might be missing something here... this break will just cause the operation to return as if it succeeded, but log splitting is still in progress right? We would then proceed to open the regions without ensuring this is completed? Maybe we need: this.requeue(); return false; REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin requested code review of "[jira] HBASE-4742 Split dead server's log in parallel".
        Reviewers: Kannan, khemani, Karthik, mbautin, JIRA

        When one region server goes down, the master will shutdown the region server and split its log.
        However, splitting log is a blocking call and it would take some time.

        If more than one region server go down, the master will split its log one by one, which is not efficient.
        Since we have the distributed log split, we could split these logs from the dead servers in parallel.

        TEST PLAN
        1) add one unit test for multiple region server shut down case
        2) running all the unit test
        3) will test it in dev cluster

        REVISION DETAIL
        https://reviews.facebook.net/D237

        AFFECTED FILES
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java

        Show
        Phabricator added a comment - Liyin requested code review of " [jira] HBASE-4742 Split dead server's log in parallel". Reviewers: Kannan, khemani, Karthik, mbautin, JIRA When one region server goes down, the master will shutdown the region server and split its log. However, splitting log is a blocking call and it would take some time. If more than one region server go down, the master will split its log one by one, which is not efficient. Since we have the distributed log split, we could split these logs from the dead servers in parallel. TEST PLAN 1) add one unit test for multiple region server shut down case 2) running all the unit test 3) will test it in dev cluster REVISION DETAIL https://reviews.facebook.net/D237 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java
        Hide
        Phabricator added a comment -

        tedyu has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:351 Something seems to be missing after the log statement.

        Minor: missing space between double quote and 'and'
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:311 A return statement is missing, if I understand correctly.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - tedyu has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:351 Something seems to be missing after the log statement. Minor: missing space between double quote and 'and' src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:311 A return statement is missing, if I understand correctly. REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        khemani has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:316 name the thread for better jstacks
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:311 I don't think there is a need for ABORT state. There is no aborting the log-splitting ... we should just keep retrying.

        In this particular case when deadServer doesn't exist then you can return success
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:295 You could put the retry loop with sleeps here? Then the main thread has a very simple state where it just requeues itself if the log-splitting is not done.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - khemani has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:316 name the thread for better jstacks src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:311 I don't think there is a need for ABORT state. There is no aborting the log-splitting ... we should just keep retrying. In this particular case when deadServer doesn't exist then you can return success src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:295 You could put the retry loop with sleeps here? Then the main thread has a very simple state where it just requeues itself if the log-splitting is not done. REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin updated the revision "[jira] HBASE-4742 Split dead server's log in parallel".
        Reviewers: Kannan, khemani, Karthik, mbautin, JIRA

        Thanks Prakash, Ted and Karthik.
        I removed the ABORT status now.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        AFFECTED FILES
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java

        Show
        Phabricator added a comment - Liyin updated the revision " [jira] HBASE-4742 Split dead server's log in parallel". Reviewers: Kannan, khemani, Karthik, mbautin, JIRA Thanks Prakash, Ted and Karthik. I removed the ABORT status now. REVISION DETAIL https://reviews.facebook.net/D237 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java
        Hide
        Phabricator added a comment -

        Liyin updated the revision "[jira] HBASE-4742 Split dead server's log in parallel".
        Reviewers: Kannan, khemani, Karthik, mbautin, JIRA

        Address Karthik's comments.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        AFFECTED FILES
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java

        Show
        Phabricator added a comment - Liyin updated the revision " [jira] HBASE-4742 Split dead server's log in parallel". Reviewers: Kannan, khemani, Karthik, mbautin, JIRA Address Karthik's comments. REVISION DETAIL https://reviews.facebook.net/D237 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java
        Hide
        Phabricator added a comment -

        khemani has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:152 How can the master close? It should not under any of these circumstances.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - khemani has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:152 How can the master close? It should not under any of these circumstances. REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        khemani has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        also, I think we should maintain the old single threaded behavior when distributed log splitting is not enabled. (You might not have to make any change becasue of splitLogLock in HMaster.splitLog() function ... but please verify)

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - khemani has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". also, I think we should maintain the old single threaded behavior when distributed log splitting is not enabled. (You might not have to make any change becasue of splitLogLock in HMaster.splitLog() function ... but please verify) REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin updated the revision "[jira] HBASE-4742 Split dead server's log in parallel".
        Reviewers: Kannan, khemani, Karthik, mbautin, JIRA

        Address Prakash's comments.
        Also I try to classify that if the distributed log splitting is true, then the master will launch multiple threads to split the log in parallel. If the distributed log splitting is not true, the master will launch multiple threads to split the log in sequentially.
        The motivation is to stop blocking the region server operation queue.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        AFFECTED FILES
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java

        Show
        Phabricator added a comment - Liyin updated the revision " [jira] HBASE-4742 Split dead server's log in parallel". Reviewers: Kannan, khemani, Karthik, mbautin, JIRA Address Prakash's comments. Also I try to classify that if the distributed log splitting is true, then the master will launch multiple threads to split the log in parallel. If the distributed log splitting is not true, the master will launch multiple threads to split the log in sequentially. The motivation is to stop blocking the region server operation queue. REVISION DETAIL https://reviews.facebook.net/D237 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java
        Hide
        Phabricator added a comment -

        mbautin has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        Liyin, good stuff! Some comments inline:

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:56 Should this field be remove now that it is superseded by logSplitResult?
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:337 Succeed to split -> Succeeded splitting
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:82 Indentation
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:93 Space between ) and op
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:133 Space between try and {
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:135 1. Please don't ignore the exception. Log it at least. What is the expected exception here? Can we catch only one particular kind of exception and fail the unit test for all other exceptions?
        2. Indentation
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:52 This variable is never read
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:68 Insert an empty line here
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:347-348 Would it make sense to throw an IOException instead?
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:113 This variable is never read
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:143 all the listener -> all the listeners
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:49 Rename to numClosedRegionServers. Otherwise it is not clear whether this is a regionserver ID or the number of them.
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:72 Line longer than 80 characters (this is currently not reported by this Phabricator setup).
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:333 Are you sure that it is OK to requeue this split request before the worker thread splitting this region server's logs completes?
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 Instead of spawning threads in an uncontrolled way like this, it would be nice to use an ExecutorService. Also, if there are hundreds of logs to split, we would probably like to restrict the number of concurrent log-splitting threads, which can also be achieved using a thread-pool executor service.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - mbautin has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". Liyin, good stuff! Some comments inline: INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:56 Should this field be remove now that it is superseded by logSplitResult? src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:337 Succeed to split -> Succeeded splitting src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:82 Indentation src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:93 Space between ) and op src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:133 Space between try and { src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:135 1. Please don't ignore the exception. Log it at least. What is the expected exception here? Can we catch only one particular kind of exception and fail the unit test for all other exceptions? 2. Indentation src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:52 This variable is never read src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:68 Insert an empty line here src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:347-348 Would it make sense to throw an IOException instead? src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:113 This variable is never read src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:143 all the listener -> all the listeners src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:49 Rename to numClosedRegionServers. Otherwise it is not clear whether this is a regionserver ID or the number of them. src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:72 Line longer than 80 characters (this is currently not reported by this Phabricator setup). src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:333 Are you sure that it is OK to requeue this split request before the worker thread splitting this region server's logs completes? src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 Instead of spawning threads in an uncontrolled way like this, it would be nice to use an ExecutorService. Also, if there are hundreds of logs to split, we would probably like to restrict the number of concurrent log-splitting threads, which can also be achieved using a thread-pool executor service. REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        tedyu has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 Here the new thread is per dead server, right ?

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - tedyu has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 Here the new thread is per dead server, right ? REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        mbautin has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 Ted: yes, that is correct. However, there could be a considerable number of unique servers to split logs for, and some throttling of the number of concurrent log splitting threads might still be good.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - mbautin has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 Ted: yes, that is correct. However, there could be a considerable number of unique servers to split logs for, and some throttling of the number of concurrent log splitting threads might still be good. REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        Answer the comments inline.

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 I agree with Ted.
        The thread number is bounded by the number of dead servers at one time naturally.

        Prakash, what do you think?
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:333 Why do you think it is NOT OK ???
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:337 What' wrong with "Succeed to split" ?
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:347-348 I believe it is more reasonable to throw RuntimeExcpetion.
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:135 I try to ignore this Exception here on purpose.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - Liyin has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". Answer the comments inline. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 I agree with Ted. The thread number is bounded by the number of dead servers at one time naturally. Prakash, what do you think? src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:333 Why do you think it is NOT OK ??? src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:337 What' wrong with "Succeed to split" ? src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:347-348 I believe it is more reasonable to throw RuntimeExcpetion. src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:135 I try to ignore this Exception here on purpose. REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        mbautin has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:333 My concern is that there might be two threads at some point splitting logs for the same server.
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:337 "Succeed" is infinitive and "succeeded" is past, so e.g. "succeeded to split" would be more appropriate. But this does not really matter.
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:347-348 OK, this is fine. In general, unchecked exceptions should only be thrown when the program is not expected to recover from the error (http://bit.ly/RuntimeException). This is probably appropriate in case of an unknown enum element.
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:135 What do you mean by "this exception" and what's the purpose of ignoring it? This will ignore all exceptions and make the code hard to debug in case something goes wrong in the try block.
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 OK, if we believe there will be no more than say 10% of servers going down at the same time, and the current cluster size is 100 nodes, this is fine. However, if we had 1000 servers and half of them went down, then the master might choke trying to split 500 logs at once.

        Prakash: do we use distributed log splitting only on master startup or during normal master operation as well?

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - mbautin has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:333 My concern is that there might be two threads at some point splitting logs for the same server. src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:337 "Succeed" is infinitive and "succeeded" is past, so e.g. "succeeded to split" would be more appropriate. But this does not really matter. src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:347-348 OK, this is fine. In general, unchecked exceptions should only be thrown when the program is not expected to recover from the error ( http://bit.ly/RuntimeException ). This is probably appropriate in case of an unknown enum element. src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:135 What do you mean by "this exception" and what's the purpose of ignoring it? This will ignore all exceptions and make the code hard to debug in case something goes wrong in the try block. src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 OK, if we believe there will be no more than say 10% of servers going down at the same time, and the current cluster size is 100 nodes, this is fine. However, if we had 1000 servers and half of them went down, then the master might choke trying to split 500 logs at once. Prakash: do we use distributed log splitting only on master startup or during normal master operation as well? REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        Thank Mikhail for your quick response.
        We have agreed on most of the discussion here.

        The remaining discussion is focusing on the number of threads launched in master for splitting dead servers log, which has made me re-considering our motivation about parallel distributed log splitting here.

        Our basic motivation is splitting log should not block the region server process queue. Also the distributed log splitting itself is designed to split log for a large number region servers. So we could batch all the dead region servers together into a queue and launch single thread to do the distributed log splitting, instead of distribute log splitting for each dead server as a separate thread.

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:333 Thank you for clarify your concern

        For each dead server, the master would receive the znode expire event for this dead server only once.
        So the master wouldn't have 2 threads split the same dead region server at same time.
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:337 Fine. I would change it to "Succeeded in splitting".
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:347-348 Thanks for clarifying.
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:135 Actually, I don't have to catch the exceptions here explicitly.
        It won't affect the unit test results.
        Thanks for the discussion.
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 1) We use distributed log splitting for the dead region server as well.

        2) Even though we use thread pool to execute, I would bound the max thread as the number of region server.
        What do you think of the max thread we should bound for the execute thread pool here?

        Also as your example mentioned here, 500 region server went down. The master would launch 500 threads to distributed log splitting in parallel. It won't choke the master too much since the split job is done on each region server side.

        3) But this discussion also leads us to another good point. Let's say if there are a large number region server dead for some reason. Shall we batch these dead region servers to split instead of splitting their log in parallel.

        Any ideas? Mikhail and Prakash ?

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - Liyin has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". Thank Mikhail for your quick response. We have agreed on most of the discussion here. The remaining discussion is focusing on the number of threads launched in master for splitting dead servers log, which has made me re-considering our motivation about parallel distributed log splitting here. Our basic motivation is splitting log should not block the region server process queue. Also the distributed log splitting itself is designed to split log for a large number region servers. So we could batch all the dead region servers together into a queue and launch single thread to do the distributed log splitting, instead of distribute log splitting for each dead server as a separate thread. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:333 Thank you for clarify your concern For each dead server, the master would receive the znode expire event for this dead server only once. So the master wouldn't have 2 threads split the same dead region server at same time. src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:337 Fine. I would change it to "Succeeded in splitting". src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:347-348 Thanks for clarifying. src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java:135 Actually, I don't have to catch the exceptions here explicitly. It won't affect the unit test results. Thanks for the discussion. src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:312 1) We use distributed log splitting for the dead region server as well. 2) Even though we use thread pool to execute, I would bound the max thread as the number of region server. What do you think of the max thread we should bound for the execute thread pool here? Also as your example mentioned here, 500 region server went down. The master would launch 500 threads to distributed log splitting in parallel. It won't choke the master too much since the split job is done on each region server side. 3) But this discussion also leads us to another good point. Let's say if there are a large number region server dead for some reason. Shall we batch these dead region servers to split instead of splitting their log in parallel. Any ideas? Mikhail and Prakash ? REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Kannan has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        In practice, the actual log splitting will happen in region server... so at the master level it is only going to consume as many threads as concurrently dead region servers... but the threads won't really be doing much expensive work. I would imagine it would to spawn say even 1000 threads without much adverse affect.

        Having said that, if region servers are repeatedly starting/dying etc. it could end in some degenerate cases where things could grow unbounded and take the master down. So limiting this seems the safer thing to do.

        Liyin: In practice, while I think what you have is just fine and unlikely to cause any serious issues... does it complicate the design too much to back this by a threadpool backed with a queue and set the limit pretty high to say 1000? If the changes are not too intrusive, I would suggest doing the threadpool. If more involved, we can layer that in in a second pass.

        regards,
        Kannan

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - Kannan has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". In practice, the actual log splitting will happen in region server... so at the master level it is only going to consume as many threads as concurrently dead region servers... but the threads won't really be doing much expensive work. I would imagine it would to spawn say even 1000 threads without much adverse affect. Having said that, if region servers are repeatedly starting/dying etc. it could end in some degenerate cases where things could grow unbounded and take the master down. So limiting this seems the safer thing to do. Liyin: In practice, while I think what you have is just fine and unlikely to cause any serious issues... does it complicate the design too much to back this by a threadpool backed with a queue and set the limit pretty high to say 1000? If the changes are not too intrusive, I would suggest doing the threadpool. If more involved, we can layer that in in a second pass. regards, Kannan REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        Kannan, Mikhail, Prakash and Ted,
        How about batch all the dead servers in a blocking queue and only launch one thread in HMaster to distributed log splitting these dead regions together.

        The interface of distributed log splitting is:
        public void splitLog(final List<String> serverNames) {}
        So it naturally designed to split a list region servers.

        The reason I proposed this idea is because the function to split log in HMaster is not well-tested in a multithread situations.
        It may involve some race condition when 2 threads are going to distributed log splitting for 2 different dead region servers.

        If the master split dead servers log in batch in a single thread, then we can avoid these potential problems and also matches our original motivations.

        What do you guys think?

        Thanks
        Liyin

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - Liyin has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". Kannan, Mikhail, Prakash and Ted, How about batch all the dead servers in a blocking queue and only launch one thread in HMaster to distributed log splitting these dead regions together. The interface of distributed log splitting is: public void splitLog(final List<String> serverNames) {} So it naturally designed to split a list region servers. The reason I proposed this idea is because the function to split log in HMaster is not well-tested in a multithread situations. It may involve some race condition when 2 threads are going to distributed log splitting for 2 different dead region servers. If the master split dead servers log in batch in a single thread, then we can avoid these potential problems and also matches our original motivations. What do you guys think? Thanks Liyin REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        mbautin has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        Liyin, could you please clarify how that single thread is going to control log splitting done remotely by multiple regionservers in parallel? Would that be some kind of an event loop? If that is easier than fixing the race conditions in HMaster you are concerned about, I'm fine with it. Or, as another alternative, we could test the existing solution more thoroughly.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - mbautin has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". Liyin, could you please clarify how that single thread is going to control log splitting done remotely by multiple regionservers in parallel? Would that be some kind of an event loop? If that is easier than fixing the race conditions in HMaster you are concerned about, I'm fine with it. Or, as another alternative, we could test the existing solution more thoroughly. REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Ted Yu added a comment -

        I don't see ProcessServerShutdown.java in 0.90 or TRUNK. So I assume it is from 0.89 branch.
        I am not familiar with 0.89 branch. So my discussion below is tailored toward 0.92/TRUNK.

        From Liyin:

        instead of distribute log splitting for each dead server as a separate thread.

        I agree with the above statement.

        In TRUNK, the method Liyin mentioned:

          public void splitLog(final List<ServerName> serverNames) throws IOException {
        

        translates server names to log paths and uses the following to split the paths:

                splitLogSize = splitLogManager.splitLogDistributed(logDirs);
        

        We can introduce a short interval for master to group the malfunctioning region servers so that the above splitLog() API can be used.
        If region servers go down with non-trivial interval in between, it is not obvious how the above splitLog() API can be used.

        Show
        Ted Yu added a comment - I don't see ProcessServerShutdown.java in 0.90 or TRUNK. So I assume it is from 0.89 branch. I am not familiar with 0.89 branch. So my discussion below is tailored toward 0.92/TRUNK. From Liyin: instead of distribute log splitting for each dead server as a separate thread. I agree with the above statement. In TRUNK, the method Liyin mentioned: public void splitLog( final List<ServerName> serverNames) throws IOException { translates server names to log paths and uses the following to split the paths: splitLogSize = splitLogManager.splitLogDistributed(logDirs); We can introduce a short interval for master to group the malfunctioning region servers so that the above splitLog() API can be used. If region servers go down with non-trivial interval in between, it is not obvious how the above splitLog() API can be used.
        Hide
        Phabricator added a comment -

        Liyin updated the revision "[jira] HBASE-4742 Split dead server's log in parallel".
        Reviewers: Kannan, khemani, Karthik, mbautin, JIRA

        Address Mikhail and Kannan's comments for using thread pool to bound the total number thread.
        The default max thread is 1000.

        Also keep testing it on the dev cluster.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        AFFECTED FILES
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
        src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java

        Show
        Phabricator added a comment - Liyin updated the revision " [jira] HBASE-4742 Split dead server's log in parallel". Reviewers: Kannan, khemani, Karthik, mbautin, JIRA Address Mikhail and Kannan's comments for using thread pool to bound the total number thread. The default max thread is 1000. Also keep testing it on the dev cluster. REVISION DETAIL https://reviews.facebook.net/D237 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/master/HMaster.java src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java
        Hide
        Phabricator added a comment -

        Liyin has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        Mikhail, since the function to split log is a blocking function. So it will return unit the splitLogManager says all the logs has been spitted successfully.

        Anyway, I implemented the ThreadPool version, which is much easy to change.
        I will keep test it in the dev cluster and discuss with Prakash about the above concern.

        In addition, in apache trunk, it handle these region server shut down cases in parallel. So by design, the function to distributed log splitting is supposed to be run in parallel.
        Let's get it confirmed by Prakash and testing.

        Thank all of you guys for reviewing it in the weekend

        Liyin

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - Liyin has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". Mikhail, since the function to split log is a blocking function. So it will return unit the splitLogManager says all the logs has been spitted successfully. Anyway, I implemented the ThreadPool version, which is much easy to change. I will keep test it in the dev cluster and discuss with Prakash about the above concern. In addition, in apache trunk, it handle these region server shut down cases in parallel. So by design, the function to distributed log splitting is supposed to be run in parallel. Let's get it confirmed by Prakash and testing. Thank all of you guys for reviewing it in the weekend Liyin REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        tedyu has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java:311 I think thread index starting from 1 would be more readable to user.
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java:314 Please the following call before returning:
        (logSplitThreadPool).allowCoreThreadTimeOut(true);

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - tedyu has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/HMaster.java:311 I think thread index starting from 1 would be more readable to user. src/main/java/org/apache/hadoop/hbase/master/HMaster.java:314 Please the following call before returning: (logSplitThreadPool).allowCoreThreadTimeOut(true); REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        tedyu has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java:314 Should read: 'Please add the following call before returning'

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - tedyu has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/HMaster.java:314 Should read: 'Please add the following call before returning' REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        Thanks for reviewing it in the weekend.

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java:311 OK I can start with 1 instead of 0
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java:314 I don't think it is necessary here.
        Because the coreThreadNumber is 0.
        Every thread launched by this thread pool is not from core pool.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - Liyin has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". Thanks for reviewing it in the weekend. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/HMaster.java:311 OK I can start with 1 instead of 0 src/main/java/org/apache/hadoop/hbase/master/HMaster.java:314 I don't think it is necessary here. Because the coreThreadNumber is 0. Every thread launched by this thread pool is not from core pool. REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        khemani has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        If you splitLog(list-of-servernames) then the function won't return until all the servers' logs are split.

        If you call multiple splitLog(servername) then each one returns as soon as it is done. This is what we are doing now and we should stay with it.

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java:305 managing threads through a thread pool is a good thing. But I doubt that there is any scenario where the number of threads will increase without limit. The number of threads will always be limited by the number of region servers. And a new incarnation of a region server is not instantiated by the master unless the old one's log splitting is done.

        but i think we should still use the thread pool. it is a better abstraction.
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:304 Can you please do a new Runnable inline over here. That will make it easier for someone to read ProcessServerShutdown() code. Having two methods run() and process() kind of makes it confusing to understand ProcessServerShutdown.
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:323 This should be LOG.debug. You might consider rate limiting this debug output.
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:333 This should be LOG.warn
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:339 Runtime.halt()? Even if you throw an exception the retry mechanism will kick in.
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:304 If master shutdown is called (abort or regular shutdown) then I am not sure that the pool is shutting down properly. Someone has to interrupt each of the threads that are have called splitLog(). Or those splitLog() threads have to be daemons.
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:316 Now that you have a thread pool you can look at the status of the submitted tasks instead of maintaining special states

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - khemani has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". If you splitLog(list-of-servernames) then the function won't return until all the servers' logs are split. If you call multiple splitLog(servername) then each one returns as soon as it is done. This is what we are doing now and we should stay with it. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/HMaster.java:305 managing threads through a thread pool is a good thing. But I doubt that there is any scenario where the number of threads will increase without limit. The number of threads will always be limited by the number of region servers. And a new incarnation of a region server is not instantiated by the master unless the old one's log splitting is done. but i think we should still use the thread pool. it is a better abstraction. src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:304 Can you please do a new Runnable inline over here. That will make it easier for someone to read ProcessServerShutdown() code. Having two methods run() and process() kind of makes it confusing to understand ProcessServerShutdown. src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:323 This should be LOG.debug. You might consider rate limiting this debug output. src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:333 This should be LOG.warn src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:339 Runtime.halt()? Even if you throw an exception the retry mechanism will kick in. src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:304 If master shutdown is called (abort or regular shutdown) then I am not sure that the pool is shutting down properly. Someone has to interrupt each of the threads that are have called splitLog(). Or those splitLog() threads have to be daemons. src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:316 Now that you have a thread pool you can look at the status of the submitted tasks instead of maintaining special states REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin updated the revision "[jira] HBASE-4742 Split dead server's log in parallel".
        Reviewers: Kannan, khemani, Karthik, mbautin, JIRA

        Address Prakash and Ted's comments.
        Thanks for reviewing in the weekend.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        AFFECTED FILES
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
        src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java

        Show
        Phabricator added a comment - Liyin updated the revision " [jira] HBASE-4742 Split dead server's log in parallel". Reviewers: Kannan, khemani, Karthik, mbautin, JIRA Address Prakash and Ted's comments. Thanks for reviewing in the weekend. REVISION DETAIL https://reviews.facebook.net/D237 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/master/HMaster.java src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java
        Hide
        Phabricator added a comment -

        lhofhansl has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        Very nice improvment!

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:285 This method is not used anywhere, correct?

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - lhofhansl has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". Very nice improvment! INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:285 This method is not used anywhere, correct? REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        khemani has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java:306 Why use a SynchronousQueue ... won't that cause the task.submitting() thread to block if the thread-pool is at capacity? (Not that we will hit this issue with default capacity set at 1000)
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java:313 I am not very sure about the ThreadPoolExecutor's behavior ... is it possible that when waiting for a task the thread pool keeps its threads in daemon mode and before a thread is chosen to execute a task it's made into a non-daemon thread?

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - khemani has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/HMaster.java:306 Why use a SynchronousQueue ... won't that cause the task.submitting() thread to block if the thread-pool is at capacity? (Not that we will hit this issue with default capacity set at 1000) src/main/java/org/apache/hadoop/hbase/master/HMaster.java:313 I am not very sure about the ThreadPoolExecutor's behavior ... is it possible that when waiting for a task the thread pool keeps its threads in daemon mode and before a thread is chosen to execute a task it's made into a non-daemon thread? REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:304 I have made these threads as daemons threads.

        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:285 yes, stale code. I shall remove this function.
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java:306 Good point, Prakash.
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java:313 I think all the threads created by this thread executor pool will run as daemon mode, which means the thread will either run until it completes or until all User Threads have completed.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - Liyin has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:304 I have made these threads as daemons threads. src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:285 yes, stale code. I shall remove this function. src/main/java/org/apache/hadoop/hbase/master/HMaster.java:306 Good point, Prakash. src/main/java/org/apache/hadoop/hbase/master/HMaster.java:313 I think all the threads created by this thread executor pool will run as daemon mode, which means the thread will either run until it completes or until all User Threads have completed. REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin updated the revision "[jira] HBASE-4742 Split dead server's log in parallel".
        Reviewers: Kannan, khemani, Karthik, mbautin, JIRA

        Address Prakash and lhofhansl 's comments.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        AFFECTED FILES
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
        src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java
        src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java

        Show
        Phabricator added a comment - Liyin updated the revision " [jira] HBASE-4742 Split dead server's log in parallel". Reviewers: Kannan, khemani, Karthik, mbautin, JIRA Address Prakash and lhofhansl 's comments. REVISION DETAIL https://reviews.facebook.net/D237 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/master/HMaster.java src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java
        Hide
        Phabricator added a comment -

        Karthik has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        Looks good to me... +1

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - Karthik has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". Looks good to me... +1 REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        khemani has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        +1

        Great work, Liyin!

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - khemani has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". +1 Great work, Liyin! REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin updated the revision "[jira] HBASE-4742 Split dead server's log in parallel".
        Reviewers: Kannan, khemani, Karthik, mbautin, JIRA

        REVISION DETAIL
        https://reviews.facebook.net/D237

        AFFECTED FILES
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
        src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java
        src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
        src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java

        Show
        Phabricator added a comment - Liyin updated the revision " [jira] HBASE-4742 Split dead server's log in parallel". Reviewers: Kannan, khemani, Karthik, mbautin, JIRA REVISION DETAIL https://reviews.facebook.net/D237 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/master/HMaster.java src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java src/test/java/org/apache/hadoop/hbase/master/TestMultiRegionServerShutDown.java
        Hide
        Phabricator added a comment -

        Kannan has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        Nice work Liyin. One inlined comment.

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:323 there are two queues, the toDoQueue and the delayedToDoQueue.

        From reading the code, it seems like "return false" will cause this to be put in the toDoQueue upstream in RegionServerOperationQueue.java.

        Should we instead use "this.requeue()" so that it is back in the delay queue, but then, return true, so that it doesn't also get put in the toDoQueue?

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - Kannan has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". Nice work Liyin. One inlined comment. INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:323 there are two queues, the toDoQueue and the delayedToDoQueue. From reading the code, it seems like "return false" will cause this to be put in the toDoQueue upstream in RegionServerOperationQueue.java. Should we instead use "this.requeue()" so that it is back in the delay queue, but then, return true , so that it doesn't also get put in the toDoQueue? REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin has commented on the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        INLINE COMMENTS
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:323 Hi Kannan,
        When testing this revision internally, I have already made this change to add this operation back to delay queue but in a different way.

        There are 2 ways to add this operation back to the delay queue instead of to do queue.
        1) call this.requeue(), which is add the operation to the delay queue and return true. It means this operation has succeeded and all the post-process-listener will be called.

        2) Throw out some exceptions here, which will be caught in the RegionServerOperationQueue. Then this operation will be put back to the delay queue bu RegionServerOperationQueue.

        After discussion with Prakash and Mikhail, Prakash prefers the 2nd solutions and Mikhail suggests to refactor the code. So each operation should return OPERATION.SUCCESS, OPERATION.FAILED and OPERATION.DLAYED.

        I have already implemented the 2nd solutions internally and get it tested in big cluster. But now I try to refactor the code as Mikhail suggested.

        I will update this revision once every thing settled down during the test.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - Liyin has commented on the revision " [jira] HBASE-4742 Split dead server's log in parallel". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java:323 Hi Kannan, When testing this revision internally, I have already made this change to add this operation back to delay queue but in a different way. There are 2 ways to add this operation back to the delay queue instead of to do queue. 1) call this.requeue(), which is add the operation to the delay queue and return true. It means this operation has succeeded and all the post-process-listener will be called. 2) Throw out some exceptions here, which will be caught in the RegionServerOperationQueue. Then this operation will be put back to the delay queue bu RegionServerOperationQueue. After discussion with Prakash and Mikhail, Prakash prefers the 2nd solutions and Mikhail suggests to refactor the code. So each operation should return OPERATION.SUCCESS, OPERATION.FAILED and OPERATION.DLAYED. I have already implemented the 2nd solutions internally and get it tested in big cluster. But now I try to refactor the code as Mikhail suggested. I will update this revision once every thing settled down during the test. REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Phabricator added a comment -

        Liyin updated the revision "[jira] HBASE-4742 Split dead server's log in parallel".
        Reviewers: Kannan, khemani, Karthik, mbautin, JIRA

        Refactor the code of RegionServerOperationQueue.

        Each operation will return OPERATION.SUCCEEDED, OPERATION.FAILED and OPERATON.DELAYED every time when it is processed.

        Based on this return value, the RegionServerOperationQueue will put them back to todo queue, delay queue or call the post-process-listener.

        REVISION DETAIL
        https://reviews.facebook.net/D237

        AFFECTED FILES
        src/main/java/org/apache/hadoop/hbase/master/ProcessRegionClose.java
        src/main/java/org/apache/hadoop/hbase/master/ProcessRegionOpen.java
        src/main/java/org/apache/hadoop/hbase/master/ProcessRegionStatusChange.java
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
        src/main/java/org/apache/hadoop/hbase/master/RegionServerOperation.java
        src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java
        src/test/java/org/apache/hadoop/hbase/client/TestScannerTimeout.java
        src/test/java/org/apache/hadoop/hbase/master/TestMasterTransitions.java
        src/test/resources/hbase-site.xml

        Show
        Phabricator added a comment - Liyin updated the revision " [jira] HBASE-4742 Split dead server's log in parallel". Reviewers: Kannan, khemani, Karthik, mbautin, JIRA Refactor the code of RegionServerOperationQueue. Each operation will return OPERATION.SUCCEEDED, OPERATION.FAILED and OPERATON.DELAYED every time when it is processed. Based on this return value, the RegionServerOperationQueue will put them back to todo queue, delay queue or call the post-process-listener. REVISION DETAIL https://reviews.facebook.net/D237 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/master/ProcessRegionClose.java src/main/java/org/apache/hadoop/hbase/master/ProcessRegionOpen.java src/main/java/org/apache/hadoop/hbase/master/ProcessRegionStatusChange.java src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java src/main/java/org/apache/hadoop/hbase/master/RegionServerOperation.java src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java src/test/java/org/apache/hadoop/hbase/client/TestScannerTimeout.java src/test/java/org/apache/hadoop/hbase/master/TestMasterTransitions.java src/test/resources/hbase-site.xml
        Hide
        Phabricator added a comment -

        Liyin updated the revision "[jira] HBASE-4742 Split dead server's log in parallel".
        Reviewers: Kannan, khemani, Karthik, mbautin, JIRA

        REVISION DETAIL
        https://reviews.facebook.net/D237

        AFFECTED FILES
        src/main/java/org/apache/hadoop/hbase/master/HMaster.java
        src/main/java/org/apache/hadoop/hbase/master/ProcessRegionClose.java
        src/main/java/org/apache/hadoop/hbase/master/ProcessRegionOpen.java
        src/main/java/org/apache/hadoop/hbase/master/ProcessRegionStatusChange.java
        src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java
        src/main/java/org/apache/hadoop/hbase/master/RegionServerOperation.java
        src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java
        src/test/java/org/apache/hadoop/hbase/client/TestScannerTimeout.java
        src/test/java/org/apache/hadoop/hbase/master/TestMasterTransitions.java
        src/test/resources/hbase-site.xml

        Show
        Phabricator added a comment - Liyin updated the revision " [jira] HBASE-4742 Split dead server's log in parallel". Reviewers: Kannan, khemani, Karthik, mbautin, JIRA REVISION DETAIL https://reviews.facebook.net/D237 AFFECTED FILES src/main/java/org/apache/hadoop/hbase/master/HMaster.java src/main/java/org/apache/hadoop/hbase/master/ProcessRegionClose.java src/main/java/org/apache/hadoop/hbase/master/ProcessRegionOpen.java src/main/java/org/apache/hadoop/hbase/master/ProcessRegionStatusChange.java src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java src/main/java/org/apache/hadoop/hbase/master/RegionServerOperation.java src/main/java/org/apache/hadoop/hbase/master/RegionServerOperationQueue.java src/test/java/org/apache/hadoop/hbase/client/TestScannerTimeout.java src/test/java/org/apache/hadoop/hbase/master/TestMasterTransitions.java src/test/resources/hbase-site.xml
        Hide
        Phabricator added a comment -

        mbautin has accepted the revision "[jira] HBASE-4742 Split dead server's log in parallel".

        Looks good to me! Nice fix in TestMasterTransitions!

        REVISION DETAIL
        https://reviews.facebook.net/D237

        Show
        Phabricator added a comment - mbautin has accepted the revision " [jira] HBASE-4742 Split dead server's log in parallel". Looks good to me! Nice fix in TestMasterTransitions! REVISION DETAIL https://reviews.facebook.net/D237
        Hide
        Nicolas Spiegelberg added a comment -

        +1. I take it this is for 89? Will commit when ported to trunk

        Show
        Nicolas Spiegelberg added a comment - +1. I take it this is for 89? Will commit when ported to trunk
        Hide
        Liyin Tang added a comment -

        @Nicolas, this is only for 89-fb. Trunk has already splited the dead server's log in parallel in another way.

        Show
        Liyin Tang added a comment - @Nicolas, this is only for 89-fb. Trunk has already splited the dead server's log in parallel in another way.

          People

          • Assignee:
            Liyin Tang
            Reporter:
            Liyin Tang
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development