Bookkeeper
  1. Bookkeeper
  2. BOOKKEEPER-199

Provide bookie readonly mode, when journal/ledgers flushing has failed with IOE

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Implemented
    • Affects Version/s: 4.0.0
    • Fix Version/s: None
    • Component/s: bookkeeper-server
    • Labels:
      None

      Description

      Bookkeeper should change to readonly(r-o) mode when the journal/ledgers flushing has failed with IOException. Later on, reject write requests on server side and will accept only the read requests from the clients, because even if flushing fails, the data in the bookie which has been flushed is still valid.

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Gavin made changes -
          Link This issue is depended upon by BOOKKEEPER-126 [ BOOKKEEPER-126 ]
          Gavin made changes -
          Link This issue blocks BOOKKEEPER-126 [ BOOKKEEPER-126 ]
          Ivan Kelly made changes -
          Fix Version/s 4.2.0 [ 12320244 ]
          Ivan Kelly made changes -
          Link This issue blocks BOOKKEEPER-126 [ BOOKKEEPER-126 ]
          Ivan Kelly made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Implemented [ 10 ]
          Ivan Kelly made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Ivan Kelly added a comment -

          Shouldn't be resolved as 'Fixed' as there was no actual code changes in this specific JIRA. will mark as 'Implemented'

          Show
          Ivan Kelly added a comment - Shouldn't be resolved as 'Fixed' as there was no actual code changes in this specific JIRA. will mark as 'Implemented'
          Hide
          Vinayakumar B added a comment -

          Thanks Uma, Ivan, Sijie, Flavio and Rakesh for discussions, reviews and comments.

          Show
          Vinayakumar B added a comment - Thanks Uma, Ivan, Sijie, Flavio and Rakesh for discussions, reviews and comments.
          Vinayakumar B made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Uma Maheswara Rao G made changes -
          Assignee Rakesh R [ rakeshr ] Vinay [ vinayrpet ]
          Hide
          Uma Maheswara Rao G added a comment -

          Thanks a lot Vinay, for your efforts in this JIRA and its a nice addition.

          Show
          Uma Maheswara Rao G added a comment - Thanks a lot Vinay, for your efforts in this JIRA and its a nice addition.
          Hide
          Vinayakumar B added a comment -

          Hi all,
          I have attached patches to BOOKKEEPER-345 and BOOKKEEPER-346 regarding the above mentioned approach. Please take a look.
          Thanks.

          Show
          Vinayakumar B added a comment - Hi all, I have attached patches to BOOKKEEPER-345 and BOOKKEEPER-346 regarding the above mentioned approach. Please take a look. Thanks.
          Hide
          Rakesh R added a comment -

          Thanks a lot Sijie for the support and helping us to finalize the approach

          Hi Vinay, Could you please start preparing patch with the new idea.

          Show
          Rakesh R added a comment - Thanks a lot Sijie for the support and helping us to finalize the approach Hi Vinay, Could you please start preparing patch with the new idea.
          Hide
          Sijie Guo added a comment -

          thanks Vinay and Rakesh. I got the idea running a separated thread checking disks. the solution seems OK for me.

          Show
          Sijie Guo added a comment - thanks Vinay and Rakesh. I got the idea running a separated thread checking disks. the solution seems OK for me.
          Hide
          Rakesh R added a comment -

          Thanks Sijie for your inputs.
          @Sijie

          I don't see any benefit running a separated thread than checking only exception happens. If I missed your idea, please correct me.

          Also it would help us to make it simple IOE block and diskfull responsibility will be assigned to one guy rather than scattered across many places like ledgerEntry, ledgerCache.

          Show
          Rakesh R added a comment - Thanks Sijie for your inputs. @Sijie I don't see any benefit running a separated thread than checking only exception happens. If I missed your idea, please correct me. Also it would help us to make it simple IOE block and diskfull responsibility will be assigned to one guy rather than scattered across many places like ledgerEntry, ledgerCache.
          Hide
          Vinayakumar B added a comment -

          Following are the benifits we will get than the approach given in BOOKKEEPER-345

          1) Less changes in the main code flow
          2) No synchronization required between addEntry(..) and flush(..) as used in BOOKKEEPER-345, hence not affecting any performance.
          3) No need to maintain any duplicate entries anywhere.
          4) Since we are detecting the disk over usage well ahead of reaching 100% disk usage, we will not loose any written data.

          Anyway, this new thread will not be doing any heavy operation(Only check the disk usages periodically).
          So adding this thread will not impact the performace of the system.

          Show
          Vinayakumar B added a comment - Following are the benifits we will get than the approach given in BOOKKEEPER-345 1) Less changes in the main code flow 2) No synchronization required between addEntry(..) and flush(..) as used in BOOKKEEPER-345 , hence not affecting any performance. 3) No need to maintain any duplicate entries anywhere. 4) Since we are detecting the disk over usage well ahead of reaching 100% disk usage, we will not loose any written data. Anyway, this new thread will not be doing any heavy operation(Only check the disk usages periodically). So adding this thread will not impact the performace of the system.
          Hide
          Sijie Guo added a comment -

          @Vinay,

          I don't see any benefit running a separated thread than checking only exception happens. If I missed your idea, please correct me.

          Show
          Sijie Guo added a comment - @Vinay, I don't see any benefit running a separated thread than checking only exception happens. If I missed your idea, please correct me.
          Hide
          Rakesh R added a comment -

          @Vinay
          I am thinking of introducing one daemon thread to check the disk periodically. We can configure the threshold. If any disk is full more than the threshold, then that disk can be excluded from next writes.

          Yeah, I also feel we can follow the similar pattern used in hdfs. If I understand your thoughts correctly, we will do the following changes:

          • When there is any IOException while adding entry, just make it simple by shutdown the bk server(I think not required to have comlpex logic to classify disk full or disk failure on IOE)
          • 'threshold' : checks the disk percentage usage rather than plain disk size. Disk percentage will be a configured item in bk_server.conf
          • 'ResourceMonitor' daemon thread and is periodically checking disk full in background. Inorder to control the concurrency level between the scanner and the addentry client, we shouldn't allow entries beyond 85% or 90% of disk size. Also scanner can look for disk error and initiate bk shutdown.

          Would like to see others thoughts?

          Show
          Rakesh R added a comment - @Vinay I am thinking of introducing one daemon thread to check the disk periodically. We can configure the threshold. If any disk is full more than the threshold, then that disk can be excluded from next writes. Yeah, I also feel we can follow the similar pattern used in hdfs. If I understand your thoughts correctly, we will do the following changes: When there is any IOException while adding entry, just make it simple by shutdown the bk server(I think not required to have comlpex logic to classify disk full or disk failure on IOE) 'threshold' : checks the disk percentage usage rather than plain disk size. Disk percentage will be a configured item in bk_server.conf 'ResourceMonitor' daemon thread and is periodically checking disk full in background. Inorder to control the concurrency level between the scanner and the addentry client, we shouldn't allow entries beyond 85% or 90% of disk size. Also scanner can look for disk error and initiate bk shutdown. Would like to see others thoughts?
          Hide
          Vinayakumar B added a comment -

          Hi all,

          I am thinking of introducing one daemon thread to check the disk periodically. We can configure the threshold. If any disk is full more than the threshold, then that disk can be excluded from next writes.

          Whats your thoughts about this..?

          Show
          Vinayakumar B added a comment - Hi all, I am thinking of introducing one daemon thread to check the disk periodically. We can configure the threshold. If any disk is full more than the threshold, then that disk can be excluded from next writes. Whats your thoughts about this..?
          Hide
          Uma Maheswara Rao G added a comment -

          got it. I think we don't need to separate it in a task. otherwise, I don't know which method is used and which method is not used. you could add it in BOOKKEEPER-345. and BOOKKEEPER-346 is based on BOOKKEEPER-345. how about doing in this way?

          +1

          @Rakesh,

          Yes true. r-o is a subset of available bookies, make sense keeping under available znode and will have the exclusion logic.

          make sense to me. On other thought, how about creating something like 'available-readonly' node, so that we can avoid excluding logics

          Show
          Uma Maheswara Rao G added a comment - got it. I think we don't need to separate it in a task. otherwise, I don't know which method is used and which method is not used. you could add it in BOOKKEEPER-345 . and BOOKKEEPER-346 is based on BOOKKEEPER-345 . how about doing in this way? +1 @Rakesh, Yes true. r-o is a subset of available bookies, make sense keeping under available znode and will have the exclusion logic. make sense to me. On other thought, how about creating something like 'available-readonly' node, so that we can avoid excluding logics
          Hide
          Rakesh R added a comment -

          got it. I think we don't need to separate it in a task.

          It seems Vinay has updated patch in BOOKKEEPER-344. I'll inform him.

          if we added a 'readonly' znode under 'available' znode, we need to exclude it from available list

          Yes true. r-o is a subset of available bookies, make sense keeping under available znode and will have the exclusion logic.

          Hi All, Appreciate thoughts from others also?

          Show
          Rakesh R added a comment - got it. I think we don't need to separate it in a task. It seems Vinay has updated patch in BOOKKEEPER-344 . I'll inform him. if we added a 'readonly' znode under 'available' znode, we need to exclude it from available list Yes true. r-o is a subset of available bookies, make sense keeping under available znode and will have the exclusion logic. Hi All, Appreciate thoughts from others also?
          Hide
          Sijie Guo added a comment - - edited

          I've just thought of separating the disk checking logic. This has sort of useful apis which EntryLogger and LedgerCache will be using. (if we have a usecase, will expose in admin also)

          got it. I think we don't need to separate it in a task. otherwise, I don't know which method is used and which method is not used. you could add it in BOOKKEEPER-345. and BOOKKEEPER-346 is based on BOOKKEEPER-345. how about doing in this way?

          Here, I just thought of bkclient behaviours: when reading it should have logic to find the r-o bookies and able to serve the read request. While writing these bookies will be masked and will not be available.

          the read request is sent by reading LedgerMetadata to find available bookies. read request doesn't look into available bookies in BookieWatcher, while write requests looked into available bookies. so moving bookie znode from available list would make a bookie r-o and no write requests would be sent to it again. it made things easier.

          How about providing a separate znode like
          '/ledgerrootpath/available/readonly/'

          if we added a 'readonly' znode under 'available' znode, we need to exclude it from available list. I would suggest not to add it under 'available' znode. I had no strong feel where to put the znode. either is OK for me.

          Show
          Sijie Guo added a comment - - edited I've just thought of separating the disk checking logic. This has sort of useful apis which EntryLogger and LedgerCache will be using. (if we have a usecase, will expose in admin also) got it. I think we don't need to separate it in a task. otherwise, I don't know which method is used and which method is not used. you could add it in BOOKKEEPER-345 . and BOOKKEEPER-346 is based on BOOKKEEPER-345 . how about doing in this way? Here, I just thought of bkclient behaviours: when reading it should have logic to find the r-o bookies and able to serve the read request. While writing these bookies will be masked and will not be available. the read request is sent by reading LedgerMetadata to find available bookies. read request doesn't look into available bookies in BookieWatcher, while write requests looked into available bookies. so moving bookie znode from available list would make a bookie r-o and no write requests would be sent to it again. it made things easier. How about providing a separate znode like '/ledgerrootpath/available/readonly/' if we added a 'readonly' znode under 'available' znode, we need to exclude it from available list. I would suggest not to add it under 'available' znode. I had no strong feel where to put the znode. either is OK for me.
          Hide
          Rakesh R added a comment -

          Thanks Sijie.

          Is the utility a separated tool that admin guys could run to check disk? Which scenarios it would be used?

          I've just thought of separating the disk checking logic. This has sort of useful apis which EntryLogger and LedgerCache will be using. (if we have a usecase, will expose in admin also)

          so if did it in this way, I think there is no changes in bookie client. we just need to change bookie server to do action when encountering too many IOExceptions:

          Here, I just thought of bkclient behaviours: when reading it should have logic to find the r-o bookies and able to serve the read request. While writing these bookies will be masked and will not be available.

          1) and 2) could be set as a configuration setting to let user decide what to do, which I called it as strategy as previous comment.

          Yeah exactly, its configurable and how about exposing config item 'readonlyModeEnabled=false' in bk_server.conf?

          so we need a separated znode to hold all r-o bookies for tracking.

          Yes, I agree with Flavio's suggestion to delete the ephemeral znode so that admin will be able to know about the faulty bookie and can act upon.

          How about providing a separate znode like
          '/ledgerrootpath/available/readonly/'....This 'readonly' persistent znode contains list of bookies represented as (IP:PORT) ephemeral znodes once the bookie marked as readonly.

          Show
          Rakesh R added a comment - Thanks Sijie. Is the utility a separated tool that admin guys could run to check disk? Which scenarios it would be used? I've just thought of separating the disk checking logic. This has sort of useful apis which EntryLogger and LedgerCache will be using. (if we have a usecase, will expose in admin also) so if did it in this way, I think there is no changes in bookie client. we just need to change bookie server to do action when encountering too many IOExceptions: Here, I just thought of bkclient behaviours: when reading it should have logic to find the r-o bookies and able to serve the read request. While writing these bookies will be masked and will not be available. 1) and 2) could be set as a configuration setting to let user decide what to do, which I called it as strategy as previous comment. Yeah exactly, its configurable and how about exposing config item 'readonlyModeEnabled=false' in bk_server.conf? so we need a separated znode to hold all r-o bookies for tracking. Yes, I agree with Flavio's suggestion to delete the ephemeral znode so that admin will be able to know about the faulty bookie and can act upon. How about providing a separate znode like '/ledgerrootpath/available/readonly/'....This 'readonly' persistent znode contains list of bookies represented as (IP:PORT) ephemeral znodes once the bookie marked as readonly.
          Hide
          Sijie Guo added a comment -

          @Rakesh,

          1. Utility functions for checking disk problem

          Is the utility a separated tool that admin guys could run to check disk? Which scenarios it would be used?

          4. Provide mechanism to detect r-o bookie by the bookie clients

          as Flavio's suggestion,

          One option I see is to have the bookie removing itself from the list of available bookies in zookeeper so that the administrator is able to determine that the bookie has declared itself faulty. The administrator can be notified of the change through a zookeeper watcher. In the meanwhile, the bookie is able to respond to read requests for which it has data. How does it sound?

          if bookie server encountered IOException and not able to write to disks, it removed itself from available bookies list and doesn't shutdown. so the bookie server could still severing read requests. there is only one minor drawback for this solution is that there is no place to track all the r-o bookies list. so we need a separated znode to hold all r-o bookies for tracking.

          so turning a bookie to r-o mode would be remove itself from available bookies list and put itself in r-o bookies list.

          so if did it in this way, I think there is no changes in bookie client. we just need to change bookie server to do action when encountering too many IOExceptions:

          1) remove bookie znode but doesn't shutdown (turning to r-o)
          2) shutdown the bookie server

          1) and 2) could be set as a configuration setting to let user decide what to do, which I called it as strategy as previous comment.

          5. Provide admin command to start the bookie in read-only mode
          

          agreed to provide the command to start bookie in read-only mode. I think for read-only mode, we just starts the bookie server which it would add itself into r-o bookies list.

          How is your opinion?

          Show
          Sijie Guo added a comment - @Rakesh, 1. Utility functions for checking disk problem Is the utility a separated tool that admin guys could run to check disk? Which scenarios it would be used? 4. Provide mechanism to detect r-o bookie by the bookie clients as Flavio's suggestion, One option I see is to have the bookie removing itself from the list of available bookies in zookeeper so that the administrator is able to determine that the bookie has declared itself faulty. The administrator can be notified of the change through a zookeeper watcher. In the meanwhile, the bookie is able to respond to read requests for which it has data. How does it sound? if bookie server encountered IOException and not able to write to disks, it removed itself from available bookies list and doesn't shutdown. so the bookie server could still severing read requests. there is only one minor drawback for this solution is that there is no place to track all the r-o bookies list. so we need a separated znode to hold all r-o bookies for tracking. so turning a bookie to r-o mode would be remove itself from available bookies list and put itself in r-o bookies list. so if did it in this way, I think there is no changes in bookie client. we just need to change bookie server to do action when encountering too many IOExceptions: 1) remove bookie znode but doesn't shutdown (turning to r-o) 2) shutdown the bookie server 1) and 2) could be set as a configuration setting to let user decide what to do, which I called it as strategy as previous comment. 5. Provide admin command to start the bookie in read-only mode agreed to provide the command to start bookie in read-only mode. I think for read-only mode, we just starts the bookie server which it would add itself into r-o bookies list. How is your opinion?
          Hide
          Rakesh R added a comment -

          Hi All, Whether to provide admin command to start the bookie in read-only mode?
          I'd like to know the opinion from others also.

          Thanks,
          Rakesh

          Show
          Rakesh R added a comment - Hi All, Whether to provide admin command to start the bookie in read-only mode? I'd like to know the opinion from others also. Thanks, Rakesh
          Rakesh R made changes -
          Link This issue is related to BOOKKEEPER-126 [ BOOKKEEPER-126 ]
          Hide
          Rakesh R added a comment -

          I'm just changing the subtask to separate issue for creating the subtasks.

          Show
          Rakesh R added a comment - I'm just changing the subtask to separate issue for creating the subtasks.
          Rakesh R made changes -
          Parent BOOKKEEPER-126 [ 12533151 ]
          Issue Type Sub-task [ 7 ] Bug [ 1 ]
          Hide
          Rakesh R added a comment -

          Thanks all for your time and inputs. Based on the discussion which we had in this jira, identified the following subtasks to implement the r-o mode incrementally.

          1. Utility functions for checking disk problem
          2. Detect IOExceptions on entrylogger and bookie should consider next ledger dir(if any)
          3. Detect IOExceptions on ledger cache and bookie should consider next ledger dir(if any)
          4. Provide mechanism to detect r-o bookie by the bookie clients
          5. Provide admin command to start the bookie in read-only mode (Raised as a suggestion)

          -Rakesh

          Show
          Rakesh R added a comment - Thanks all for your time and inputs. Based on the discussion which we had in this jira, identified the following subtasks to implement the r-o mode incrementally. Utility functions for checking disk problem Detect IOExceptions on entrylogger and bookie should consider next ledger dir(if any) Detect IOExceptions on ledger cache and bookie should consider next ledger dir(if any) Provide mechanism to detect r-o bookie by the bookie clients Provide admin command to start the bookie in read-only mode (Raised as a suggestion) -Rakesh
          Hide
          Rakesh R added a comment -

          Here One more scenario needs to be handled which was discussed on BOOKKEEPER-180.
          Adding new ledger and flushing is failed in SyncThread due to disk full. But Server did not shutdown here.

          2012-06-11 140014,696 - ERROR [SyncThreadInterleavedLedgerStorage@156] - Exception flushing Ledger

          java.io.IOException No space left on device
          	at sun.nio.ch.FileDispatcher.write0(Native Method)
          	at sun.nio.ch.FileDispatcher.write(FileDispatcher.java39)
          	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java69)
          	at sun.nio.ch.IOUtil.write(IOUtil.java26)
          	at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java198)
          	at org.apache.bookkeeper.bookie.BufferedChannel.flush(BufferedChannel.java109)
          	at org.apache.bookkeeper.bookie.EntryLogger.flush(EntryLogger.java280)
          	at org.apache.bookkeeper.bookie.InterleavedLedgerStorage.flush(InterleavedLedgerStorage.java154)
          	at org.apache.bookkeeper.bookie.Bookie$SyncThread.run(Bookie.java200)
          
          Show
          Rakesh R added a comment - Here One more scenario needs to be handled which was discussed on BOOKKEEPER-180 . Adding new ledger and flushing is failed in SyncThread due to disk full. But Server did not shutdown here. 2012-06-11 140014,696 - ERROR [SyncThreadInterleavedLedgerStorage@156] - Exception flushing Ledger java.io.IOException No space left on device at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.FileDispatcher.write(FileDispatcher.java39) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java69) at sun.nio.ch.IOUtil.write(IOUtil.java26) at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java198) at org.apache.bookkeeper.bookie.BufferedChannel.flush(BufferedChannel.java109) at org.apache.bookkeeper.bookie.EntryLogger.flush(EntryLogger.java280) at org.apache.bookkeeper.bookie.InterleavedLedgerStorage.flush(InterleavedLedgerStorage.java154) at org.apache.bookkeeper.bookie.Bookie$SyncThread.run(Bookie.java200)
          Hide
          Uma Maheswara Rao G added a comment -

          Thanks a lot, Flavio and Sijie for your opinions.
          @Falvio

          One option I see is to have the bookie removing itself from the list of available bookies in zookeeper so that the administrator is able to determine that the bookie has declared itself faulty. The administrator can be notified of the change through a zookeeper watcher. In the meanwhile, the bookie is able to respond to read requests for which it has data. How does it sound?

          Agreed.

          So how about make it as a strategy, user could decide shutting down or turning to read-only when encountering a faulty bookie?

          Sounds good to me.
          Just adding config parameter for this option also should be ok. if we enable it, Bookie will turn automatically to read-only mode. If we don't enable it, it will sutdown by default. Admins also can start the bookie in read-only mode explicitly.

          Show
          Uma Maheswara Rao G added a comment - Thanks a lot, Flavio and Sijie for your opinions. @Falvio One option I see is to have the bookie removing itself from the list of available bookies in zookeeper so that the administrator is able to determine that the bookie has declared itself faulty. The administrator can be notified of the change through a zookeeper watcher. In the meanwhile, the bookie is able to respond to read requests for which it has data. How does it sound? Agreed. So how about make it as a strategy, user could decide shutting down or turning to read-only when encountering a faulty bookie? Sounds good to me. Just adding config parameter for this option also should be ok. if we enable it, Bookie will turn automatically to read-only mode. If we don't enable it, it will sutdown by default. Admins also can start the bookie in read-only mode explicitly.
          Hide
          Sijie Guo added a comment -

          shutting down a bookie or turning it to read-only depends on applications. For HDFS NameNode, it might be critical to shut down a faulty bookie immediately, For a pub/sub case, turning a bookie server into read-only case, it could make a bookie still serving read-request which doesn't affect message traffic. So how about make it as a strategy, user could decide shutting down or turning to read-only when encountering a faulty bookie?

          I agreed Flavio's option about failure noticeable. Beside that, another option for JMX monitoring is that we could expose bookie status thru JMX, if a bookie is turned to read-only, admin could be alerted by JMX value changing.

          Show
          Sijie Guo added a comment - shutting down a bookie or turning it to read-only depends on applications. For HDFS NameNode, it might be critical to shut down a faulty bookie immediately, For a pub/sub case, turning a bookie server into read-only case, it could make a bookie still serving read-request which doesn't affect message traffic. So how about make it as a strategy, user could decide shutting down or turning to read-only when encountering a faulty bookie? I agreed Flavio's option about failure noticeable. Beside that, another option for JMX monitoring is that we could expose bookie status thru JMX, if a bookie is turned to read-only, admin could be alerted by JMX value changing.
          Hide
          Flavio Junqueira added a comment -

          One option I see is to have the bookie removing itself from the list of available bookies in zookeeper so that the administrator is able to determine that the bookie has declared itself faulty. The administrator can be notified of the change through a zookeeper watcher. In the meanwhile, the bookie is able to respond to read requests for which it has data. How does it sound?

          Let me say that I don't feel strongly about making the bookie read-only, but I'm ok with it if others feel it is important.

          Show
          Flavio Junqueira added a comment - One option I see is to have the bookie removing itself from the list of available bookies in zookeeper so that the administrator is able to determine that the bookie has declared itself faulty. The administrator can be notified of the change through a zookeeper watcher. In the meanwhile, the bookie is able to respond to read requests for which it has data. How does it sound? Let me say that I don't feel strongly about making the bookie read-only, but I'm ok with it if others feel it is important.
          Hide
          Uma Maheswara Rao G added a comment -

          Yes, my point is that, when we try to save the edits in Namenode dirs (lets say 4 volumes), if all 4 volumes failed, we will sutdown the node. Because this is a serious problem. Here also we can make it read only. But it is important to make it noticeable to admins.

          Same way, Why can't we make the bookie down when all ledger disks failed. So, that admin can take action immediateltly because one node down in cluster and it is very much noticeable. We can provide read only mode as option to start the node in that mode, instead of converting automatically. If admin realizes that, he don't have any other option to replace the node or con't correct, that will user headache to start the node with read only option.

          If we silently convert to read-only mode, user may not get any failures. So, he may think that cluster is fine. But one node might have been converted to read only mode and not able to store the new data any more.

          What I am trying to say is, make the serious failures noticeable and make it fail loudly.

          Am I missing some thing here?

          Show
          Uma Maheswara Rao G added a comment - Yes, my point is that, when we try to save the edits in Namenode dirs (lets say 4 volumes), if all 4 volumes failed, we will sutdown the node. Because this is a serious problem. Here also we can make it read only. But it is important to make it noticeable to admins. Same way, Why can't we make the bookie down when all ledger disks failed. So, that admin can take action immediateltly because one node down in cluster and it is very much noticeable. We can provide read only mode as option to start the node in that mode, instead of converting automatically. If admin realizes that, he don't have any other option to replace the node or con't correct, that will user headache to start the node with read only option. If we silently convert to read-only mode, user may not get any failures. So, he may think that cluster is fine. But one node might have been converted to read only mode and not able to store the new data any more. What I am trying to say is, make the serious failures noticeable and make it fail loudly. Am I missing some thing here?
          Hide
          Flavio Junqueira added a comment -

          When flush fails (assuming disk failed here), why do we need to turn it to read only automatically?

          Upon such a failure, we shouldn't let the bookie continue processing updates because its state is broken. There are two choices I see in this case: the bookie stops processing requests altogether or it serves only read requests. Making it read-only sort of assumes that the existing bookie state is correct, but it may not be. Given that we use digests, we could perhaps assume that they are sufficient to detect corrupt data. But, the safest choice is really to stop processing requests and drop all connections.

          I think admin should get the failure notice, instead of silently turning out to read only mode.

          Agreed.

          I did not get the actual use case of read only mode. When we have only one disk and if it fails we may have to shutdown the node right?

          If we lose the transaction log device, then we can't keep writing, but we can read from the ledger devices. If we have multiple ledger devices and one is gone, then we could still serve reads from the other disks. In this last case, the bookie can only serve read requests partially, since the ledgers stored in the broken devicewon't be served.

          If might be complex to handle the different scenarios and perhaps the best course of action would be to reject new requests altogether for simplicity. Perhaps this is the point you're getting at, Uma.

          Show
          Flavio Junqueira added a comment - When flush fails (assuming disk failed here), why do we need to turn it to read only automatically? Upon such a failure, we shouldn't let the bookie continue processing updates because its state is broken. There are two choices I see in this case: the bookie stops processing requests altogether or it serves only read requests. Making it read-only sort of assumes that the existing bookie state is correct, but it may not be. Given that we use digests, we could perhaps assume that they are sufficient to detect corrupt data. But, the safest choice is really to stop processing requests and drop all connections. I think admin should get the failure notice, instead of silently turning out to read only mode. Agreed. I did not get the actual use case of read only mode. When we have only one disk and if it fails we may have to shutdown the node right? If we lose the transaction log device, then we can't keep writing, but we can read from the ledger devices. If we have multiple ledger devices and one is gone, then we could still serve reads from the other disks. In this last case, the bookie can only serve read requests partially, since the ledgers stored in the broken devicewon't be served. If might be complex to handle the different scenarios and perhaps the best course of action would be to reject new requests altogether for simplicity. Perhaps this is the point you're getting at, Uma.
          Hide
          Uma Maheswara Rao G added a comment -

          Bookkeeper should change to readonly(r-o) mode when the journal/ledgers flushing has failed with IOException. Later on, reject write requests on server side and will accept only the read requests from the clients, because even if flushing fails, the data in the bookie which has been flushed is still valid.

          When flush fails (assuming disk failed here), why do we need to turn it to read only automatically?
          I think admin should get the failure notice, instead of silently turning out to read only mode.
          See the cases in Namenode and Datanode, we will shutdown the nodes when all disks failed in the node. So, that admin can take the actions to replace with the good node or correcting the existing one. I did not get the actual use case of read only mode. When we have only one disk and if it fails we may have to shutdown the node right?
          If we have other cases for it, it would be great if we have this feature as optional. Could you please clarify more on requirement perspective? is it like safemode? how this node can recover in-future?

          Show
          Uma Maheswara Rao G added a comment - Bookkeeper should change to readonly(r-o) mode when the journal/ledgers flushing has failed with IOException. Later on, reject write requests on server side and will accept only the read requests from the clients, because even if flushing fails, the data in the bookie which has been flushed is still valid. When flush fails (assuming disk failed here), why do we need to turn it to read only automatically? I think admin should get the failure notice, instead of silently turning out to read only mode. See the cases in Namenode and Datanode, we will shutdown the nodes when all disks failed in the node. So, that admin can take the actions to replace with the good node or correcting the existing one. I did not get the actual use case of read only mode. When we have only one disk and if it fails we may have to shutdown the node right? If we have other cases for it, it would be great if we have this feature as optional. Could you please clarify more on requirement perspective? is it like safemode? how this node can recover in-future?
          Ivan Kelly made changes -
          Fix Version/s 4.2.0 [ 12320244 ]
          Fix Version/s 4.1.0 [ 12319145 ]
          Sijie Guo made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Sijie Guo added a comment -

          canceling it until new patch is attached.

          Show
          Sijie Guo added a comment - canceling it until new patch is attached.
          Hide
          Rakesh R added a comment -

          Yeah Sijie, You're right, BookieWatcher should have the logic to refresh the bk list.Thanks for the clarifications. I will consider other scenarios also.

          Show
          Rakesh R added a comment - Yeah Sijie, You're right, BookieWatcher should have the logic to refresh the bk list.Thanks for the clarifications. I will consider other scenarios also.
          Hide
          Sijie Guo added a comment -

          I am keeping the r-o status in bookie's presence znode(which is emphemeral), when restarting its creating new presence znode and like a normal r-w bookie.

          3 bookies, b1, b2, b3.

          at time t1, BookieWatcher has
          availableBookies: b1, b2, b3,
          roBookies: empty

          at time t2, bookie b3 is turned to R-O mode. BookieWatcher changes the lists to:
          availableBookies: b1, b2, b3
          roBookies: b3

          #getAdditionalBookie will get bookie from (b1,b2)

          at time t3, bookie b3 is restarted to R-O mode. BookieWatcher's list is still same as t2.
          since b3 will not be removed from roBookies list.
          so #getAdditionalBookie will still just can get bookies from (b1, b2), right?

          Hope you are pointing me to consider the ledger failures of 'SyncThread' ?

          Also, I missed dealing the ledgerCache failures, it would be great to see your thoughts/opinion ?

          SyncThread is one place.
          The other place is in LedgerCache, it would call #flushLedger when evicting pages.

          BTW, from the title of this jira, it also need to handle journal flushing failure, right? maybe you need to change code in Bookie#run() to turn a bookie into a R-O mode instead of shutting down it directly.

          Show
          Sijie Guo added a comment - I am keeping the r-o status in bookie's presence znode(which is emphemeral), when restarting its creating new presence znode and like a normal r-w bookie. 3 bookies, b1, b2, b3. at time t1, BookieWatcher has availableBookies: b1, b2, b3, roBookies: empty at time t2, bookie b3 is turned to R-O mode. BookieWatcher changes the lists to: availableBookies: b1, b2, b3 roBookies: b3 #getAdditionalBookie will get bookie from (b1,b2) at time t3, bookie b3 is restarted to R-O mode. BookieWatcher's list is still same as t2. since b3 will not be removed from roBookies list. so #getAdditionalBookie will still just can get bookies from (b1, b2), right? Hope you are pointing me to consider the ledger failures of 'SyncThread' ? Also, I missed dealing the ledgerCache failures, it would be great to see your thoughts/opinion ? SyncThread is one place. The other place is in LedgerCache, it would call #flushLedger when evicting pages. BTW, from the title of this jira, it also need to handle journal flushing failure, right? maybe you need to change code in Bookie#run() to turn a bookie into a R-O mode instead of shutting down it directly.
          Hide
          Rakesh R added a comment -

          Thanks Sijie for the comments & time Yes, I will put the re-worked patch in the review board.

          1) in BookieWatcher, if a bookie server is restarted (maybe by admin guys) from R-O mode, seems there is no logic to remove a bookie from R-O mode to write mode. do I miss something?

          I am keeping the r-o status in bookie's presence znode(which is emphemeral), when restarting its creating new presence znode and like a normal r-w bookie.

          One thing might be missed is that when a force flushing (ledger eviction happened) failed, we might need to consider turning the bookie into R-O mode too.

          Hope you are pointing me to consider the ledger failures of 'SyncThread' ?

          Also, I missed dealing the ledgerCache failures, it would be great to see your thoughts/opinion ?

          Thanks,
          Rakesh

          Show
          Rakesh R added a comment - Thanks Sijie for the comments & time Yes, I will put the re-worked patch in the review board. 1) in BookieWatcher, if a bookie server is restarted (maybe by admin guys) from R-O mode, seems there is no logic to remove a bookie from R-O mode to write mode. do I miss something? I am keeping the r-o status in bookie's presence znode(which is emphemeral), when restarting its creating new presence znode and like a normal r-w bookie. One thing might be missed is that when a force flushing (ledger eviction happened) failed, we might need to consider turning the bookie into R-O mode too. Hope you are pointing me to consider the ledger failures of 'SyncThread' ? Also, I missed dealing the ledgerCache failures, it would be great to see your thoughts/opinion ? Thanks, Rakesh
          Hide
          Sijie Guo added a comment -

          One thing might be missed is that when a force flushing (ledger eviction happened) failed, we might need to consider turning the bookie into R-O mode too.

          Show
          Sijie Guo added a comment - One thing might be missed is that when a force flushing (ledger eviction happened) failed, we might need to consider turning the bookie into R-O mode too.
          Hide
          Sijie Guo added a comment -

          thanks Rakesh, it is good work.

          just some comments over your patch.

          1) in BookieWatcher, if a bookie server is restarted (maybe by admin guys) from R-O mode, seems there is no logic to remove a bookie from R-O mode to write mode. do I miss something?

          2) it would be better to put "R-O" as a constant define, maybe put it in BookieProtocol

          3) I don't seem anyone use "roEnsembles" defined in LedgerMetadata.java. so who will use it?

          4) the exception is thrown when there is no writable dirs found, so the error message should be improved so admin guys could known what happened. BTW, I am not sure BadLedgerDirsException is good enough to describe such kind of issue, how about NoWritableLedgerDirException?

          throw new Bookie.BadLedgerDirsException("Failed to create new log in the ledger dirs : " + list);
          
          +        ArrayList<File> listOfDirs = new ArrayList<File>(list);
          +        listOfDirs.removeAll(writeFailedDirs);
          

          How about we keep a list of writable dirs, so we don't need to init a list and remove failed dirs again.

          5) in Bookie#addEntryWithRo,

          +                zk.setData(myPresenceInZK, "R-O".getBytes(), 0);
          

          seems like it is a conditional set, it might fail if other one did changes on that znode to increment znode version. I am not sure is there any consideration when you using 0 version. why not using -1?
          from my side, in future we could provide a tool for admin guys to turn a bookie server into R-O mode to fix some issue then turn it back to W-R mode. so the znode version would be changed outside bookie server, it would be better to using -1 matching any versions of znode.

          BTW, it would be better to put a patch to review board (https://reviews.apache.org) if the patch is not simple, which might be convient for reviewing

          Show
          Sijie Guo added a comment - thanks Rakesh, it is good work. just some comments over your patch. 1) in BookieWatcher, if a bookie server is restarted (maybe by admin guys) from R-O mode, seems there is no logic to remove a bookie from R-O mode to write mode. do I miss something? 2) it would be better to put "R-O" as a constant define, maybe put it in BookieProtocol 3) I don't seem anyone use "roEnsembles" defined in LedgerMetadata.java. so who will use it? 4) the exception is thrown when there is no writable dirs found, so the error message should be improved so admin guys could known what happened. BTW, I am not sure BadLedgerDirsException is good enough to describe such kind of issue, how about NoWritableLedgerDirException? throw new Bookie.BadLedgerDirsException( "Failed to create new log in the ledger dirs : " + list); + ArrayList<File> listOfDirs = new ArrayList<File>(list); + listOfDirs.removeAll(writeFailedDirs); How about we keep a list of writable dirs, so we don't need to init a list and remove failed dirs again. 5) in Bookie#addEntryWithRo, + zk.setData(myPresenceInZK, "R-O" .getBytes(), 0); seems like it is a conditional set, it might fail if other one did changes on that znode to increment znode version. I am not sure is there any consideration when you using 0 version. why not using -1? from my side, in future we could provide a tool for admin guys to turn a bookie server into R-O mode to fix some issue then turn it back to W-R mode. so the znode version would be changed outside bookie server, it would be better to using -1 matching any versions of znode. BTW, it would be better to put a patch to review board ( https://reviews.apache.org ) if the patch is not simple, which might be convient for reviewing
          Rakesh R made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Fix Version/s 4.1.0 [ 12319145 ]
          Hide
          Rakesh R added a comment -

          Hi Sijie/Ivan,

          I have attached the initial patch.
          Here I have tried to make the server r-o mode only on ledger failures(untouched existing logic of, bookie shutdown on journal failures) and simultaneously will update r-o status on its corresponding ephemeral entry on ZK. This status will be lost on server restart. Otw I could use persistent node, but there is a chance of orphan entries, if I restart the server with diff ip/port. Please give opinion on this or any thoughts.

          I will also try to add more test cases.

          Thanks,
          Rakesh

          Show
          Rakesh R added a comment - Hi Sijie/Ivan, I have attached the initial patch. Here I have tried to make the server r-o mode only on ledger failures(untouched existing logic of, bookie shutdown on journal failures) and simultaneously will update r-o status on its corresponding ephemeral entry on ZK. This status will be lost on server restart. Otw I could use persistent node, but there is a chance of orphan entries, if I restart the server with diff ip/port. Please give opinion on this or any thoughts. I will also try to add more test cases. Thanks, Rakesh
          Rakesh R made changes -
          Attachment BOOKKEEPER-199.patch [ 12521947 ]
          Hide
          Flavio Junqueira added a comment -

          Hi Rakesh, The idea we had for BOOKKEEPER-201 was very simple. If we observe that the latency of writing to a given disk is high, then an operator can manually tell the bookie to stop creating ledgers on that disk. In the case that the bookie has multiple ledger devices available, it can slowly shift the traffic to the other devices.

          The case of a broken disk is a bit more complicated. In this case, the best course of action I can see right now is to have an operator removing the bookie from the pool and executing a bookie recovery procedure for the ledger fragments on the dead disk.

          The cases you're mentioning are a bit different, since you can observe problems by catching exceptions, so you can automate the decision of removing the bookie from the pool. It sounds ok with me to do it.

          Show
          Flavio Junqueira added a comment - Hi Rakesh, The idea we had for BOOKKEEPER-201 was very simple. If we observe that the latency of writing to a given disk is high, then an operator can manually tell the bookie to stop creating ledgers on that disk. In the case that the bookie has multiple ledger devices available, it can slowly shift the traffic to the other devices. The case of a broken disk is a bit more complicated. In this case, the best course of action I can see right now is to have an operator removing the bookie from the pool and executing a bookie recovery procedure for the ledger fragments on the dead disk. The cases you're mentioning are a bit different, since you can observe problems by catching exceptions, so you can automate the decision of removing the bookie from the pool. It sounds ok with me to do it.
          Hide
          Rakesh R added a comment -

          Yup, as per our discussion in the BOOKKEEPER-126, here I was trying to tick off the failed(IOE) ledger dir. I feel this is similar to 'continuously underperforming' idea in BOOKKEEPER-201. Please give more details on the underperforming logic?

          Show
          Rakesh R added a comment - Yup, as per our discussion in the BOOKKEEPER-126 , here I was trying to tick off the failed(IOE) ledger dir. I feel this is similar to 'continuously underperforming' idea in BOOKKEEPER-201 . Please give more details on the underperforming logic?
          Hide
          Ivan Kelly added a comment -

          This probably has a lot of crossover with BOOKKEEPER-201

          Show
          Ivan Kelly added a comment - This probably has a lot of crossover with BOOKKEEPER-201
          Hide
          Rakesh R added a comment -

          Following are just some thoughts to take the first step:

          1. ledger flushing throws IOException:
            Say bookie has multiple ledger dirs -> /tmp/bk1-data, /tmp/bk2-data)
            On IOException, bookie should consider next ledger dir and probably mark the tried dirs as BAD_FOR_WRITE, so this will not be selected next time. Finally, if there is no success, then switch bookie to r-o mode.
          2. Journal flushing throws IOException:
            Immediately switch to r-o mode. AFAIK, there is no multiple dirs for the journal.

          Thanks,
          Rakesh

          Show
          Rakesh R added a comment - Following are just some thoughts to take the first step: ledger flushing throws IOException: Say bookie has multiple ledger dirs -> /tmp/bk1-data, /tmp/bk2-data) On IOException, bookie should consider next ledger dir and probably mark the tried dirs as BAD_FOR_WRITE, so this will not be selected next time. Finally, if there is no success, then switch bookie to r-o mode. Journal flushing throws IOException: Immediately switch to r-o mode. AFAIK, there is no multiple dirs for the journal. Thanks, Rakesh
          Rakesh R made changes -
          Summary Support ReadOnly mode when IOE flushing journal/ledgers Provide bookie readonly mode, when journal/ledgers flushing has failed with IOE
          Fix Version/s 4.1.0 [ 12319145 ]
          Component/s bookkeeper-server [ 12314394 ]
          Component/s bookkeeper-client [ 12314393 ]
          Rakesh R made changes -
          Field Original Value New Value
          Fix Version/s 4.1.0 [ 12319145 ]
          Rakesh R created issue -

            People

            • Assignee:
              Vinayakumar B
              Reporter:
              Rakesh R
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development