Description
We encountered a following "Unexpected gap in segments" error when manually synchronizing OM DB on OM follower that has been stopped for a few hours.
2024-11-07 21:49:32,940 [om4@group-13A745F1EB59-StateMachineUpdater] ERROR org.apache.ratis.server.impl.StateMachineUpdater: om4@group-13A745F1EB59-StateMachineUpdater caught a Throwable. java.lang.IllegalStateException: Unexpected gap in segments: binarySearch(88354999707) returns -1, segments=[log-88363996241_88364000257, log-88364000258_88364004199, log-88364004200_88364008231, log-88364008232_88364012246, log-88364012247_88364016452, log-88364016453_88364020483, log-88364020484_88364024600, log-88364024601_88364028704, log-88364028705_88364032801, log-88364032802_88364036811, log-88364036812_88364040811, log-88364040812_88364044806, log-88364044807_88364048845, log-88364048846_88364053013, log-88364053014_88364057206, log-88364057207_88364061416, log-88364061417_88364065583, log-88364065584_88364069652, log-88364069653_88364073908, log-88364073909_88364078037, log-88364078038_88364082338, log-88364082339_88364086503, log-88364086504_88364090669, log-88364090670_88364094827, log-88364094828_88364099047, log-88364099048_88364103228, log-88364103229_88364107373, log-88364107374_88364111564, log-88364111565_88364115651, log-88364115652_88364119684, log-88364119685_88364123867, log-88364123868_88364124644, log-88364124645_88364128703, log-88364128704_88364132765, log-88364132766_88364136825, log-88364136826_88364140811, log-88364140812_88364144887, log-88364144888_88364149042, log-88364149043_88364153379, log-88364153380_88364157732, log-88364157733_88364161937, log-88364161938_88364166039, log-88364166040_88364170087, log-88364170088_88364174135, log-88364174136_88364178144, log-88364178145_88364182260, log-88364182261_88364186208, log-88364186209_88364190136, log-88364190137_88364194445, log-88364194446_88364198500, log-88364198501_88364202507, log-88364202508_88364206398, log-88364206399_88364210433, log-88364210434_88364214441, log-88364214442_88364218538, log-88364218539_88364222548, log-88364222549_88364226618, log-88364226619_88364230699, log-88364230700_88364234762, log-88364234763_88364238784, log-88364238785_88364242687, log-88364242688_88364246625, log-88364246626_88364250581, log-88364250582_88364254520, log-88364254521_88364258544, log-88364258545_88364262687, log-88364262688_88364266687, log-88364266688_88364270677, log-88364270678_88364274675, log-88364274676_88364278687, log-88364278688_88364282796, log-88364282797_88364287134, log-88364287135_88364291229, log-88364291230_88364295199, log-88364295200_88364299138, log-88364299139_88364303033, log-88364303034_88364307192, log-88364307193_88364311099, log-88364311100_88364315135, log-88364315136_88364319072, log-88364319073_88364322884, log-88364322885_88364326897, log-88364326898_88364330876, log-88364330877_88364334809, log-88364334810_88364338728, log-88364338729_88364342864, log-88364342865_88364346842, log-88364346843_88364350811, log-88364350812_88364354727, log-88364354728_88364358758, log-88364358759_88364359500, log-88364359501_88364363662, log-88364363663_88364367743, log-88364367744_88364371709, log-88364371710_88364375763, log-88364375764_88364379715, log-88364379716_88364383734, log-88364383735_88364387563, log-88364387564_88364391573, log-88364391574_88364395627, log-88364395628_88364399634, log-88364399635_88364403770, log-88364403771_88364408068, log-88364408069_88364412129, log-88364412130_88364416145, log-88364416146_88364420177, log-88364420178_88364424190, log-88364424191_88364428162, log-88364428163_88364432284, log-88364432285_88364436218, log-88364436219_88364440288, log-88364440289_88364444352, log-88364444353_88364448196, log-88364448197_88364452189, log-88364452190_88364456120, log-88364456121_88364460132, log-88364460133_88364463990, log-88364463991_88364468111, log-88364468112_88364472158, log-88364472159_88364476323, log-88364476324_88364480303, log-88364480304_88364484414, log-88364484415_88364488460, log-88364488461_88364492577, log-88364492578_88364496658, log-88364496659_88364500681, log-88364500682_88364504681, log-88364504682_88364508692, log-88364508693_88364512735, log-88364512736_88364516709, log-88364516710_88364520628, log-88364520629_88364524444, log-88364524445_88364528459, log-88364528460_88364532564, log-88364532565_88364536546, log-88364536547_88364540655, log-88364540656_88364544713, log-88364544714_88364548738, log-88364548739_88364552734, log-88364552735_88364556745, log-88364556746_88364560570, log-88364560571_88364564711, log-88364564712_88364568778, log-88364568779_88364572855, log-88364572856_88364577025, log-88364577026_88364580991, log-88364580992_88364585005, log-88364585006_88364589177, log-88364589178_88364593117, log-88364593118_88364596544, log-88364596545_88364600628, log-88364600629_88364604666, log-88364604667_88364608788, log-88364608789_88364612623, log-88364612624_88364616469, log-88364616470_88364620418, log-88364620419_88364624447, log-88364624448_88364628364, log-88364628365_88364632583, log-88364632584_88364636690, log-88364636691_88364640840, log-88364640841_88364645154, log-88364645155_88364649391, log-88364649392_88364653616, log-88364653617_88364657719, log-88364657720_88364662007, log-88364662008_88364666323, log-88364666324_88364670449, log-88364670450_88364674849, log-88364674850_88364679290, log-88364679291_88364683748, log-88364683749_88364688166, log-88364688167_88364692147, log-88364692148_88364696480, log-88364696481_88364700948, log-88364700949_88364705067, log-88364705068_88364709420, log-88364709421_88364713675, log-88364713676_88364718120, log-88364718121_88364722375, log-88364722376_88364726870, log-88364726871_88364731208, log-88364731209_88364735403, log-88364735404_88364739660, log-88364739661_88364744079, log-88364744080_88364748313, log-88364748314_88364752767, log-88364752768_88364756923, log-88364756924_88364761130, log-88364761131_88364765458, log-88364765459_88364769659, log-88364769660_88364773864, log-88364773865_88364778029, log-88364778030_88364782373, log-88364782374_88364786843, log-88364786844_88364791187, log-88364791188_88364795576, log-88364795577_88364799757, log-88364799758_88364804091, log-88364804092_88364808438, log-88364808439_88364812735, log-88364812736_88364817053, log-88364817054_88364821337, log-88364821338_88364825482, log-88364825483_88364829678, log-88364829679_88364833850, log-88364833851_88364838114, log-88364838115_88364842299, log-88364842300_88364846583, log-88364846584_88364849925, log-88364849926_88364854127, log-88364854128_88364858268, log-88364858269_88364862345, log-88364862346_88364866641, log-88364866642_88364870877, log-88364870878_88364875147, log-88364875148_88364879433, log-88364879434_88364883886, log-88364883887_88364888223, log-88364888224_88364892556, log-88364892557_88364896921, log-88364896922_88364901295, log-88364901296_88364905640, log-88364905641_88364909861, log-88364909862_88364914097, log-88364914098_88364918297, log-88364918298_88364922609, log-88364922610_88364926902, log-88364926903_88364931383, log-88364931384_88364935609, log-88364935610_88364940046, log-88364940047_88364944407, log-88364944408_88364948542, log-88364948543_88364952764, log-88364952765_88364956959, log-88364956960_88364961303, log-88364961304_88364965492, log-88364965493_88364969682, log-88364969683_88364973850, log-88364973851_88364978007, log-88364978008_88364982280, log-88364982281_88364986516, log-88364986517_88364990776, log-88364990777_88364995029, log-88364995030_88364999288]
When synchronizing the OM follower with the OM leader, we cleaned the OM ratis and ratis-snapshot directories and uses rsync to sync the OM DB (that contains the last applied index). Afterwards, we restart the slow OM follower which will receives the AppendEntries from the leader instead of the notifyInstallSnapshot due to the leader's purge preservation configuration. However, since the follower does not have some of the previous log segments, the first purge will trigger the "Unexpected gap in segments" since the purge index is earlier than the first Raft log index in Ratis log directory.
I suspect that this might also happen in general case for a new Raft server with raft.server.snapshot.auto.trigger.threshold and raft.server.log.purge.gap that are too small but with very large raft.server.log.purge.preservation.log.num, provided raft.server.log.purge.upto.snapshot.index is true.
A possible solution is to not purge when the suggested index is lower than the first segmented log index, instead of throwing exception.
Attachments
Attachments
Issue Links
- relates to
-
RATIS-2056 IllegalStateException: Unexpected gap in segments: binarySearch
- Resolved
- links to