Bookkeeper
  1. Bookkeeper
  2. BOOKKEEPER-126

EntryLogger doesn't detect when one of it's logfiles is corrupt

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      If an entry log is corrupt, the bookie will ignore any entries past the corruption. Quorum writes stops this being a problem at the moment, but we should detect corruptions like this and rereplicate if necessary.

        Issue Links

          Activity

          Hide
          Ivan Kelly added a comment -

          Closing as dupe as discussed.

          Show
          Ivan Kelly added a comment - Closing as dupe as discussed.
          Hide
          Uma Maheswara Rao G added a comment -

          JIRA description:

          but we should detect corruptions like this and rereplicate if necessary.

          Yes, on closing BK-237 completely and BK-199, this JIRA can be closed.

          But, In BK-237, the actual sub task(BOOKKEEPER-293) to solve this problem is not yet addressed .
          So, I don't think we can close this as resolved/fixed/implemented now.

          I am ok to close this JIRA as duplicate with them(199,293) as other they will handle this functionality on closer.

          Show
          Uma Maheswara Rao G added a comment - JIRA description: but we should detect corruptions like this and rereplicate if necessary. Yes, on closing BK-237 completely and BK-199, this JIRA can be closed. But, In BK-237, the actual sub task( BOOKKEEPER-293 ) to solve this problem is not yet addressed . So, I don't think we can close this as resolved/fixed/implemented now. I am ok to close this JIRA as duplicate with them(199,293) as other they will handle this functionality on closer.
          Hide
          Sijie Guo added a comment -

          As Rakesh R mentioned, this jira is used to initiate BOOKKEEPER-199 and BOOKKEEPER-237. then we could link them and mark this jira as resolved. if we want new functionality to improve disk failure handling. A new jira with new discussions would be better.

          Show
          Sijie Guo added a comment - As Rakesh R mentioned, this jira is used to initiate BOOKKEEPER-199 and BOOKKEEPER-237 . then we could link them and mark this jira as resolved. if we want new functionality to improve disk failure handling. A new jira with new discussions would be better.
          Hide
          Flavio Junqueira added a comment -

          Hi Rakesh,

          Thanks for chiming in. My questions comes mostly from the fact that this jira is marked as a blocker for 4.2.0, and we are trying to cut a release.

          It sounds fine to incorporate other mechanisms that allow us to recover more efficiently from various faults, but I'm not sure we should have them for 4.2.0. Is it ok if we have those for 4.3.0 or do you think they are blockers for 4.2.0? In the case that we move this new functionality to 4.3.0, it would be great to create new jiras or update existing jira to reflect it.

          Show
          Flavio Junqueira added a comment - Hi Rakesh, Thanks for chiming in. My questions comes mostly from the fact that this jira is marked as a blocker for 4.2.0, and we are trying to cut a release. It sounds fine to incorporate other mechanisms that allow us to recover more efficiently from various faults, but I'm not sure we should have them for 4.2.0. Is it ok if we have those for 4.3.0 or do you think they are blockers for 4.2.0? In the case that we move this new functionality to 4.3.0, it would be great to create new jiras or update existing jira to reflect it.
          Hide
          Rakesh R added a comment -

          Hi Flavio,

          I'd like to add few more and hope the following will help us to give more clarity/conclusion.

          This JIRA has helped us to initiate discussion and covered the following cases:
          1) ledger entry failures during flushing/addEntry as part of BOOKKEEPER-199.
          2) bookie failures will initiate re-replication and will be recovered the ledger entries by autorecovery process BOOKKEEPER-237.

          There could be an area of improvement for handling "disk failures"(this is also another kind of data corruption). Say we have few closed ledgers which is present in disk1 and this disk got failed. Presently if this happens admin has option to shutdown the bookie and inturn autorecovery would start re-replication. But can we think of an improvement of (bookie)self disk scanning and a healing mechanism.

          Whats your opinion?

          -Rakesh

          Show
          Rakesh R added a comment - Hi Flavio, I'd like to add few more and hope the following will help us to give more clarity/conclusion. This JIRA has helped us to initiate discussion and covered the following cases: 1) ledger entry failures during flushing/addEntry as part of BOOKKEEPER-199 . 2) bookie failures will initiate re-replication and will be recovered the ledger entries by autorecovery process BOOKKEEPER-237 . There could be an area of improvement for handling "disk failures"(this is also another kind of data corruption). Say we have few closed ledgers which is present in disk1 and this disk got failed. Presently if this happens admin has option to shutdown the bookie and inturn autorecovery would start re-replication. But can we think of an improvement of (bookie)self disk scanning and a healing mechanism. Whats your opinion? -Rakesh
          Hide
          Flavio Junqueira added a comment -

          My understanding is that this jira has been fixed in BOOKKEEPER-199. Can anyone confirm just so that we resolve this jira?

          Show
          Flavio Junqueira added a comment - My understanding is that this jira has been fixed in BOOKKEEPER-199 . Can anyone confirm just so that we resolve this jira?
          Hide
          Sijie Guo added a comment -

          since the fsck like tool is planned to be added in 4.2.0, how about moving this task (including its sub-tasks BOOKKEEPER-199) from 4.1.0 to 4.2.0?

          Show
          Sijie Guo added a comment - since the fsck like tool is planned to be added in 4.2.0, how about moving this task (including its sub-tasks BOOKKEEPER-199 ) from 4.1.0 to 4.2.0?
          Hide
          Sijie Guo added a comment -

          > One corner case, if this re-write operation is failed in all the ledger dirs?, probably forced to choose another bookie and form a new ensemble?

          maybe. a simple way is to choose a bookie to replicate all entries of the ensemble contains the corrupted/missing.

          Say, entries 0-100 ledger metadata mapping is
          0 (A, B, C)
          50(B, C, D)
          End Ledger:100

          as your example, 30-39 is corrupted, B is in the corn case as you stated, it forced to choose another bookie E. E replicates the entries between 0~50 belongs to B. And replace (A, B, C) to (A, E, C), as what BookKeeperAdmin does.

          > Secondly, for an opened/in-recovery ledger, how about the idea to check all the brothers and if no read success corresponding to x entry, will consider (x - 1) as the last entry. Packets in flight will be considered in the next pass, is it ok?

          basically seems ok.

          Show
          Sijie Guo added a comment - > One corner case, if this re-write operation is failed in all the ledger dirs?, probably forced to choose another bookie and form a new ensemble? maybe. a simple way is to choose a bookie to replicate all entries of the ensemble contains the corrupted/missing. Say, entries 0-100 ledger metadata mapping is 0 (A, B, C) 50(B, C, D) End Ledger:100 as your example, 30-39 is corrupted, B is in the corn case as you stated, it forced to choose another bookie E. E replicates the entries between 0~50 belongs to B. And replace (A, B, C) to (A, E, C), as what BookKeeperAdmin does. > Secondly, for an opened/in-recovery ledger, how about the idea to check all the brothers and if no read success corresponding to x entry, will consider (x - 1) as the last entry. Packets in flight will be considered in the next pass, is it ok? basically seems ok.
          Hide
          Rakesh R added a comment -

          Oh, I think I understood, for the corrupted/missing entries, the bookie should read from the brother bookie and tries a write operation to himself(no ensemble change req).
          One corner case, if this re-write operation is failed in all the ledger dirs?, probably forced to choose another bookie and form a new ensemble?

          Secondly, for an opened/in-recovery ledger, how about the idea to check all the brothers and if no read success corresponding to x entry, will consider (x - 1) as the last entry. Packets in flight will be considered in the next pass, is it ok?

          Show
          Rakesh R added a comment - Oh, I think I understood, for the corrupted/missing entries, the bookie should read from the brother bookie and tries a write operation to himself(no ensemble change req). One corner case, if this re-write operation is failed in all the ledger dirs?, probably forced to choose another bookie and form a new ensemble? Secondly, for an opened/in-recovery ledger, how about the idea to check all the brothers and if no read success corresponding to x entry, will consider (x - 1) as the last entry. Packets in flight will be considered in the next pass, is it ok?
          Hide
          Sijie Guo added a comment -

          thanks, Rakesh.

          +1 for opening a new jira discussion r-o mode when IOE flushing journal/ledgers.

          b) Read the entries and identify missing entries if any?
          Yeah, the DistributionScheduling is happening in the client side and batch reading is also good.
          I am thinking that the ledgers are local to the server and how about read them directly instead of using PerChannelBookieClient?.

          oh, seems that I don't explain clearly at my previous comment. As my thought, bookie server would just find the corrupted/missing entries that it should own, then schedule a re-replication procedure itself to read the corrupted/missing entries from its brother bookie servers (in same quorum). so the read is a remote read from other server.

          in this way, we don't even to change the metdata in zookeeper.

          as the example you explain,

          Say, entries 0-100 ledger metadata mapping is
          0 (A, B, C)
          50(B, C, D)
          End Ledger:100

          B runs a scanner itself, it found that 30-39 is corrupted/missing. it schedule a re-replication on (30-39), the re-replication would be a remote read (30-39) from C or D. we don't need to change ledger metdata, changing ledger metdata will introduce distribute consensus issue (you can refer discussion in BOOKKEEPER-112).

          @Sijie
          another tough thing is we need to tell closed ledger from opened/in-recovery ledger, when handling last ensemble of opened/in-recovery ledger.

          I am missing something, Could you give more details on this?

          for a closed ledger, we know the entry range of an ensemble. but for an opened/in-recovery ledger, we have no idea about the end entry of last ensemble.

          Show
          Sijie Guo added a comment - thanks, Rakesh. +1 for opening a new jira discussion r-o mode when IOE flushing journal/ledgers. b) Read the entries and identify missing entries if any? Yeah, the DistributionScheduling is happening in the client side and batch reading is also good. I am thinking that the ledgers are local to the server and how about read them directly instead of using PerChannelBookieClient?. oh, seems that I don't explain clearly at my previous comment. As my thought, bookie server would just find the corrupted/missing entries that it should own, then schedule a re-replication procedure itself to read the corrupted/missing entries from its brother bookie servers (in same quorum). so the read is a remote read from other server. in this way, we don't even to change the metdata in zookeeper. as the example you explain, Say, entries 0-100 ledger metadata mapping is 0 (A, B, C) 50(B, C, D) End Ledger:100 B runs a scanner itself, it found that 30-39 is corrupted/missing. it schedule a re-replication on (30-39), the re-replication would be a remote read (30-39) from C or D. we don't need to change ledger metdata, changing ledger metdata will introduce distribute consensus issue (you can refer discussion in BOOKKEEPER-112 ). @Sijie another tough thing is we need to tell closed ledger from opened/in-recovery ledger, when handling last ensemble of opened/in-recovery ledger. I am missing something, Could you give more details on this? for a closed ledger, we know the entry range of an ensemble. but for an opened/in-recovery ledger, we have no idea about the end entry of last ensemble.
          Hide
          Rakesh R added a comment -

          So if I understand the conclusion correctly, we have discussed and identified two cases to be implemented as part of this jira:

          1. When ledger flushing failed with IOException?
            Soln r-o mode:
            >> On IOE bookie (say, multi ledger dirs -> /tmp/bk1-data, /tmp/bk2-data etc) should see next ledger dirs for writing and mark the tried dirs as BAD_FOR_WRITE. Finally, if there is no success, then switch to r-o mode.
            >> Also, if journal failed with IOE, immediately switch to r-o mode.
            Shall I open a subtask for the impl?
          2. Ledger entries got corrupted due to disk failures or bad sectors?
            Soln scanner approach:
            IMHO, The following are the sequence of the healing procedure:
          • a) Perform scan and prepare entries owning:
            >> On startup bookie would contact ZK for the ledger metadata and on every write it would update the ledger metadata map.
            >> Special datastructure <ledgerDirId, <entryId, replica bookies>> needs to designed for the same contains ledgerId, entries owning, ledger dirs etc. ?
          • b) Read the entries and identify missing entries if any?
            Yeah, the DistributionScheduling is happening in the client side and batch reading is also good.
            I am thinking that the ledgers are local to the server and how about read them directly instead of using PerChannelBookieClient?.
          • c) Initiate re-replication:
            Corrupted bookie first identify the peer bookie which has the copy and send notification to this for re-replication. Here, it could use ZK watchers for sending the notification, for this each bookie should listen to a specfic persistent znode say 'underreplicaEntries'. The corrupted bookie should update the data <ledgerId, missingEntryIds> to 'underreplicaEntries' of the corresponding bookie which has the copy. On notification, the peer bookie should use the same logic of DistributionScheduling algo which presents in the client side.
            Is it legal, server depending on client?, otw server could randomly select a re-replica bookie and update the ZK ledger metadata?

          How the ZK ledger metadata ('nextReplicaIndexToReadFrom') looks like after re-replication?
          For example:
          Say, entries 0-100 ledger metadata mapping is
          0 (A, B, C)
          50(B, C, D)
          End Ledger:100

          Assume, entries 30 to 39 got corrupted in B and say rereplicated to E. Is it like?
          0(A, B, C)
          30(E, B, C)
          40(B, C, D)
          50(B, C, D)

          If you agree with the above approaches, probably do a detailed write-up.

          @Sijie
          another tough thing is we need to tell closed ledger from opened/in-recovery ledger, when handling last ensemble of opened/in-recovery ledger.

          I am missing something, Could you give more details on this?

          Show
          Rakesh R added a comment - So if I understand the conclusion correctly, we have discussed and identified two cases to be implemented as part of this jira: When ledger flushing failed with IOException? Soln r-o mode: >> On IOE bookie (say, multi ledger dirs -> /tmp/bk1-data, /tmp/bk2-data etc) should see next ledger dirs for writing and mark the tried dirs as BAD_FOR_WRITE. Finally, if there is no success, then switch to r-o mode. >> Also, if journal failed with IOE, immediately switch to r-o mode. Shall I open a subtask for the impl? Ledger entries got corrupted due to disk failures or bad sectors? Soln scanner approach: IMHO, The following are the sequence of the healing procedure: a) Perform scan and prepare entries owning: >> On startup bookie would contact ZK for the ledger metadata and on every write it would update the ledger metadata map. >> Special datastructure <ledgerDirId, <entryId, replica bookies>> needs to designed for the same contains ledgerId, entries owning, ledger dirs etc. ? b) Read the entries and identify missing entries if any? Yeah, the DistributionScheduling is happening in the client side and batch reading is also good. I am thinking that the ledgers are local to the server and how about read them directly instead of using PerChannelBookieClient?. c) Initiate re-replication: Corrupted bookie first identify the peer bookie which has the copy and send notification to this for re-replication. Here, it could use ZK watchers for sending the notification, for this each bookie should listen to a specfic persistent znode say 'underreplicaEntries'. The corrupted bookie should update the data <ledgerId, missingEntryIds> to 'underreplicaEntries' of the corresponding bookie which has the copy. On notification, the peer bookie should use the same logic of DistributionScheduling algo which presents in the client side. Is it legal, server depending on client?, otw server could randomly select a re-replica bookie and update the ZK ledger metadata? How the ZK ledger metadata ('nextReplicaIndexToReadFrom') looks like after re-replication? For example: Say, entries 0-100 ledger metadata mapping is 0 (A, B, C) 50(B, C, D) End Ledger:100 Assume, entries 30 to 39 got corrupted in B and say rereplicated to E. Is it like? 0(A, B, C) 30(E, B, C) 40(B, C, D) 50(B, C, D) If you agree with the above approaches, probably do a detailed write-up. @Sijie another tough thing is we need to tell closed ledger from opened/in-recovery ledger, when handling last ensemble of opened/in-recovery ledger. I am missing something, Could you give more details on this?
          Hide
          Sijie Guo added a comment -

          yup. scanner is a possible way to handle under-replicated blocks. following just some thoughts from mine.

          an entry might be placed multiple times in different entry log files, due to journal replaying. the only referred entry position is recorded in ledger index, so scanner may scan ledger by ledger. the place to run the scanner, I guess, it would be better in GarbageCollectorThread, after gc actions (those gc ledgers we don't need to care).

          when scanning a ledger, bookie server should know what entries it should own. which means, bookie server needs the distribution info of a ledger. maybe we can record what DistributionSchedule a ledger used in ledger metadata.

          for inter-bookie communication, why not consider using PerChannelBookieClient? And it may be better to add a batch read op for performance consideration.

          another tough thing is we need to tell closed ledger from opened/in-recovery ledger, when handling last ensemble of opened/in-recovery ledger.

          Show
          Sijie Guo added a comment - yup. scanner is a possible way to handle under-replicated blocks. following just some thoughts from mine. an entry might be placed multiple times in different entry log files, due to journal replaying. the only referred entry position is recorded in ledger index, so scanner may scan ledger by ledger. the place to run the scanner, I guess, it would be better in GarbageCollectorThread, after gc actions (those gc ledgers we don't need to care). when scanning a ledger, bookie server should know what entries it should own. which means, bookie server needs the distribution info of a ledger. maybe we can record what DistributionSchedule a ledger used in ledger metadata. for inter-bookie communication, why not consider using PerChannelBookieClient? And it may be better to add a batch read op for performance consideration. another tough thing is we need to tell closed ledger from opened/in-recovery ledger, when handling last ensemble of opened/in-recovery ledger.
          Hide
          Rakesh R added a comment -

          Yeah. when writing, bookkeeper client not choose the r/o bookie irrespective of ledger and journal directories are in same or diff disk would be more feasible to me.

          @Sijie
          flushing failure will not cause any entry under-replicated. (journal replay will recover it). The case we need consider is that entries before lastLogMark. If corruption happened on these entries, they are under-replicated.

          Regarding this point: corruption due to disk failure will be a corner case, but I feel it would be good to consider this also as bookkeeper is intended for very sensitive metadata (either in this jira or a separate jira task). Here it might required to have a periodic scanners and should handle under-replicated blocks.

          Show
          Rakesh R added a comment - Yeah. when writing, bookkeeper client not choose the r/o bookie irrespective of ledger and journal directories are in same or diff disk would be more feasible to me. @Sijie flushing failure will not cause any entry under-replicated. (journal replay will recover it). The case we need consider is that entries before lastLogMark. If corruption happened on these entries, they are under-replicated. Regarding this point: corruption due to disk failure will be a corner case, but I feel it would be good to consider this also as bookkeeper is intended for very sensitive metadata (either in this jira or a separate jira task). Here it might required to have a periodic scanners and should handle under-replicated blocks.
          Hide
          Sijie Guo added a comment -

          yeah. when I were working on BOOKKEEPER-180, I had considered let bookie go into readonly mode. one thing to do is to reject write requests on server side, the other thing is to let bookkeeper client not choose the readonly bookie, otherwise it would increment latency due to changing ensemble.

          Show
          Sijie Guo added a comment - yeah. when I were working on BOOKKEEPER-180 , I had considered let bookie go into readonly mode. one thing to do is to reject write requests on server side, the other thing is to let bookkeeper client not choose the readonly bookie, otherwise it would increment latency due to changing ensemble.
          Hide
          Ivan Kelly added a comment -

          Hmm, im not sure about shutting down now actually, because even if flushing fails, the data in the bookie which has been flushed is still valid. It might make more sense to make the bookie readonly.

          Show
          Ivan Kelly added a comment - Hmm, im not sure about shutting down now actually, because even if flushing fails, the data in the bookie which has been flushed is still valid. It might make more sense to make the bookie readonly.
          Hide
          Sijie Guo added a comment -

          > Can we narrow down to cases where IOException occurs on flushing ledger entries and bookie is still running. Only those entries would select as under-replicated,

          I think flushing failure will not cause any entry under-replicated. (journal replay will recover it). The case we need consider is that entries before lastLogMark. If corruption happened on these entries, they are under-replicated. Your proposal-2 and proposal-3 could be used on detecting/re-replicating these entries.

          The only side-effect of flushing failure is all following writes may fail, but the reads could still succeed, those flushed failed data are still buffered on EntryLogger, they could be read.

          If we don't shut down the bookie server, it would be still in the available list. write requests still can be sent to this bookie, but they would fail, client would choose new ensemble to write, which increase the writes latency (as what we found in BOOKKEEPER-180).

          for some IOExceptions such as 'No enough disk space', we should shutdown bookie server immediately to exclude it from available list. I am not sure is there any other recoverable io exception (means first time flush failed with an IOException, second time it succeed)? If not, I think we could shutdown bookie server when encountering IOException during flushing data.

          Show
          Sijie Guo added a comment - > Can we narrow down to cases where IOException occurs on flushing ledger entries and bookie is still running. Only those entries would select as under-replicated, I think flushing failure will not cause any entry under-replicated. (journal replay will recover it). The case we need consider is that entries before lastLogMark. If corruption happened on these entries, they are under-replicated. Your proposal-2 and proposal-3 could be used on detecting/re-replicating these entries. The only side-effect of flushing failure is all following writes may fail, but the reads could still succeed, those flushed failed data are still buffered on EntryLogger, they could be read. If we don't shut down the bookie server, it would be still in the available list. write requests still can be sent to this bookie, but they would fail, client would choose new ensemble to write, which increase the writes latency (as what we found in BOOKKEEPER-180 ). for some IOExceptions such as 'No enough disk space', we should shutdown bookie server immediately to exclude it from available list. I am not sure is there any other recoverable io exception (means first time flush failed with an IOException, second time it succeed)? If not, I think we could shutdown bookie server when encountering IOException during flushing data.
          Hide
          Rakesh R added a comment -

          @Sijie
          I agree with you, simple logic is to shutdown the bookie when thresold reaches and give the ctrl to the bookie recovery admin tool or restart the bookie. But I just added the alternative idea of handling replica to dig more...

          Show
          Rakesh R added a comment - @Sijie I agree with you, simple logic is to shutdown the bookie when thresold reaches and give the ctrl to the bookie recovery admin tool or restart the bookie. But I just added the alternative idea of handling replica to dig more...
          Hide
          Rakesh R added a comment -

          Thanks Sijie,Ivan for the suggestions

          I think, the journal IOException would immediately reaches to the client as an addEntry failure, so the client would be able to act upon it.

          IMO, Can we narrow down to cases where IOException occurs on flushing ledger entries and bookie is still running. Only those entries would select as under-replicated, either an external tool can be triggered or shutdown the bookie.

          Also, I am bit confused when to shutdown the bookie, how to define the threshold value for no: of IOExceptions. I feel, instead of shutdown the bookie on IOException, shall we make use of ZK metadata(the entry-bookie mappings) and identify a way using ZK(watchers) to notify peer bookies which has a replica of that entry (build inter-bookie protocol through ZK).

          Show
          Rakesh R added a comment - Thanks Sijie,Ivan for the suggestions I think, the journal IOException would immediately reaches to the client as an addEntry failure, so the client would be able to act upon it. IMO, Can we narrow down to cases where IOException occurs on flushing ledger entries and bookie is still running. Only those entries would select as under-replicated, either an external tool can be triggered or shutdown the bookie. Also, I am bit confused when to shutdown the bookie, how to define the threshold value for no: of IOExceptions. I feel, instead of shutdown the bookie on IOException, shall we make use of ZK metadata(the entry-bookie mappings) and identify a way using ZK(watchers) to notify peer bookies which has a replica of that entry (build inter-bookie protocol through ZK).
          Hide
          Sijie Guo added a comment -

          @Ivan

          a good point. currently we don't kill the bookie.

          if the ledger directory and journal directory in same disk, the journal flushing would be failed, then bookie would be killed.

          I think it would be great to add such logic to shutdown bookie when encountering too many IOException during flushing.

          Show
          Sijie Guo added a comment - @Ivan a good point. currently we don't kill the bookie. if the ledger directory and journal directory in same disk, the journal flushing would be failed, then bookie would be killed. I think it would be great to add such logic to shutdown bookie when encountering too many IOException during flushing.
          Hide
          Ivan Kelly added a comment -

          @Sijie
          Do we kill the bookie though? I think if an many errors occurs on flushing, we should take the bookie out of rotation, as it indicates a failing disk.

          @Rakesh
          Proposal-2 is interesting, but it would only run on read, by which time it could be too late. I think Proposal-3 is something we need in any case, though to start it wouldn't have to be a daemon, but a tool that an admin could run to verify the filesystem is in order.

          Show
          Ivan Kelly added a comment - @Sijie Do we kill the bookie though? I think if an many errors occurs on flushing, we should take the bookie out of rotation, as it indicates a failing disk. @Rakesh Proposal-2 is interesting, but it would only run on read, by which time it could be too late. I think Proposal-3 is something we need in any case, though to start it wouldn't have to be a daemon, but a tool that an admin could run to verify the filesystem is in order.
          Hide
          Sijie Guo added a comment -

          Thanks, Rakesh R.

          For Proposal-1, flush error on SyncThread would not cause data loss. if flush error happens, we don't roll the log marker, so all the entries are still in journal files. these journal files could be replayed when bookie server restarted.

          Show
          Sijie Guo added a comment - Thanks, Rakesh R. For Proposal-1, flush error on SyncThread would not cause data loss. if flush error happens, we don't roll the log marker, so all the entries are still in journal files. these journal files could be replayed when bookie server restarted.
          Hide
          Rakesh R added a comment -

          Yes, I agree with you. Its good, if able to handle the under-replication in bookkeeper.

          Following are the multiple thoughts comes to my mind, please go through this.

          Proposal-1) As per my observation apart from 'bookie down' scenario(here it can automate admin tool), the failure of anyone of the following 'flush()' operation can leads to dataloss. Since it is async opr the client will be unaware about these failures, further entries will override the data and so only these entries needs to considered as 'under-replicated' and initiate under-replica action.
          Bookie.java

          try {
              ledgerCache.flushLedger(true);
          } catch (IOException e) {
              LOG.error("Exception flushing Ledger", e);
              flushFailed = true;
          }
          try {
              entryLogger.flush();
          } catch (IOException e) {
              LOG.error("Exception flushing entry logger", e);
              flushFailed = true;
          }
          

          Proposal-2) Initiate the recovery whenever the client finds any missing entries and then succesfully get the same from next bookie.
          Still there is a gap of dataloss, say some data got lost/corrupt and no read operation in near future.

          Proposal-3) Daemon thread can be associated with every bookie and do periodical scanning of its own ledgers and its entries, if found any errors can contact ZK and tries to initiate replication of those entries.
          In this case, it needs to build a mechanism to communicate between bookies, as per my understanding there is no inter-bookie protocol exists. Also the cost of scannig will be very high if the ledgers/entries are more

          -Rakesh

          Show
          Rakesh R added a comment - Yes, I agree with you. Its good, if able to handle the under-replication in bookkeeper. Following are the multiple thoughts comes to my mind, please go through this. Proposal-1) As per my observation apart from 'bookie down' scenario(here it can automate admin tool), the failure of anyone of the following 'flush()' operation can leads to dataloss. Since it is async opr the client will be unaware about these failures, further entries will override the data and so only these entries needs to considered as 'under-replicated' and initiate under-replica action. Bookie.java try { ledgerCache.flushLedger(true); } catch (IOException e) { LOG.error("Exception flushing Ledger", e); flushFailed = true; } try { entryLogger.flush(); } catch (IOException e) { LOG.error("Exception flushing entry logger", e); flushFailed = true; } Proposal-2) Initiate the recovery whenever the client finds any missing entries and then succesfully get the same from next bookie. Still there is a gap of dataloss, say some data got lost/corrupt and no read operation in near future. Proposal-3) Daemon thread can be associated with every bookie and do periodical scanning of its own ledgers and its entries, if found any errors can contact ZK and tries to initiate replication of those entries. In this case, it needs to build a mechanism to communicate between bookies, as per my understanding there is no inter-bookie protocol exists. Also the cost of scannig will be very high if the ledgers/entries are more -Rakesh
          Hide
          Ivan Kelly added a comment -

          I haven't seen this issue occur in the wild, but it's something we've reasoned is possible. So imagine that a logfile becomes corrupt, be it truncation or full of junk. When the bookie tries to read an entry from the bookie which was contained within the corrupted section, it will fail as if that entry did not exist. This is safe within the system, because the entry will be read from another bookie. However, we've lost a replica, so the data is underreplicated and we don't know it. For this reason, each bookie should run some sort of fsck process at an interval to ensure that everything is replicated sufficiently.

          Show
          Ivan Kelly added a comment - I haven't seen this issue occur in the wild, but it's something we've reasoned is possible. So imagine that a logfile becomes corrupt, be it truncation or full of junk. When the bookie tries to read an entry from the bookie which was contained within the corrupted section, it will fail as if that entry did not exist. This is safe within the system, because the entry will be read from another bookie. However, we've lost a replica, so the data is underreplicated and we don't know it. For this reason, each bookie should run some sort of fsck process at an interval to ensure that everything is replicated sufficiently.
          Hide
          Rakesh R added a comment -

          Hi Ivan,

          yeah, its pretty interesting. Could you please give more details on entry log corruption, the possible cases.

          -Rakesh

          Show
          Rakesh R added a comment - Hi Ivan, yeah, its pretty interesting. Could you please give more details on entry log corruption, the possible cases. -Rakesh

            People

            • Assignee:
              Unassigned
              Reporter:
              Ivan Kelly
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development