Uploaded image for project: 'Ratis'
  1. Ratis
  2. RATIS-2186

Raft log purge preservation might purge log index that does not exist

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.2.0
    • server
    • None

    Description

      We encountered a following "Unexpected gap in segments" error when manually synchronizing OM DB on OM follower that has been stopped for a few hours.

      2024-11-07 21:49:32,940 [om4@group-13A745F1EB59-StateMachineUpdater] ERROR org.apache.ratis.server.impl.StateMachineUpdater: om4@group-13A745F1EB59-StateMachineUpdater caught a Throwable.
      java.lang.IllegalStateException: Unexpected gap in segments: binarySearch(88354999707) returns -1, segments=[log-88363996241_88364000257, log-88364000258_88364004199, log-88364004200_88364008231, log-88364008232_88364012246, log-88364012247_88364016452, log-88364016453_88364020483, log-88364020484_88364024600, log-88364024601_88364028704, log-88364028705_88364032801, log-88364032802_88364036811, log-88364036812_88364040811, log-88364040812_88364044806, log-88364044807_88364048845, log-88364048846_88364053013, log-88364053014_88364057206, log-88364057207_88364061416, log-88364061417_88364065583, log-88364065584_88364069652, log-88364069653_88364073908, log-88364073909_88364078037, log-88364078038_88364082338, log-88364082339_88364086503, log-88364086504_88364090669, log-88364090670_88364094827, log-88364094828_88364099047, log-88364099048_88364103228, log-88364103229_88364107373, log-88364107374_88364111564, log-88364111565_88364115651, log-88364115652_88364119684, log-88364119685_88364123867, log-88364123868_88364124644, log-88364124645_88364128703, log-88364128704_88364132765, log-88364132766_88364136825, log-88364136826_88364140811, log-88364140812_88364144887, log-88364144888_88364149042, log-88364149043_88364153379, log-88364153380_88364157732, log-88364157733_88364161937, log-88364161938_88364166039, log-88364166040_88364170087, log-88364170088_88364174135, log-88364174136_88364178144, log-88364178145_88364182260, log-88364182261_88364186208, log-88364186209_88364190136, log-88364190137_88364194445, log-88364194446_88364198500, log-88364198501_88364202507, log-88364202508_88364206398, log-88364206399_88364210433, log-88364210434_88364214441, log-88364214442_88364218538, log-88364218539_88364222548, log-88364222549_88364226618, log-88364226619_88364230699, log-88364230700_88364234762, log-88364234763_88364238784, log-88364238785_88364242687, log-88364242688_88364246625, log-88364246626_88364250581, log-88364250582_88364254520, log-88364254521_88364258544, log-88364258545_88364262687, log-88364262688_88364266687, log-88364266688_88364270677, log-88364270678_88364274675, log-88364274676_88364278687, log-88364278688_88364282796, log-88364282797_88364287134, log-88364287135_88364291229, log-88364291230_88364295199, log-88364295200_88364299138, log-88364299139_88364303033, log-88364303034_88364307192, log-88364307193_88364311099, log-88364311100_88364315135, log-88364315136_88364319072, log-88364319073_88364322884, log-88364322885_88364326897, log-88364326898_88364330876, log-88364330877_88364334809, log-88364334810_88364338728, log-88364338729_88364342864, log-88364342865_88364346842, log-88364346843_88364350811, log-88364350812_88364354727, log-88364354728_88364358758, log-88364358759_88364359500, log-88364359501_88364363662, log-88364363663_88364367743, log-88364367744_88364371709, log-88364371710_88364375763, log-88364375764_88364379715, log-88364379716_88364383734, log-88364383735_88364387563, log-88364387564_88364391573, log-88364391574_88364395627, log-88364395628_88364399634, log-88364399635_88364403770, log-88364403771_88364408068, log-88364408069_88364412129, log-88364412130_88364416145, log-88364416146_88364420177, log-88364420178_88364424190, log-88364424191_88364428162, log-88364428163_88364432284, log-88364432285_88364436218, log-88364436219_88364440288, log-88364440289_88364444352, log-88364444353_88364448196, log-88364448197_88364452189, log-88364452190_88364456120, log-88364456121_88364460132, log-88364460133_88364463990, log-88364463991_88364468111, log-88364468112_88364472158, log-88364472159_88364476323, log-88364476324_88364480303, log-88364480304_88364484414, log-88364484415_88364488460, log-88364488461_88364492577, log-88364492578_88364496658, log-88364496659_88364500681, log-88364500682_88364504681, log-88364504682_88364508692, log-88364508693_88364512735, log-88364512736_88364516709, log-88364516710_88364520628, log-88364520629_88364524444, log-88364524445_88364528459, log-88364528460_88364532564, log-88364532565_88364536546, log-88364536547_88364540655, log-88364540656_88364544713, log-88364544714_88364548738, log-88364548739_88364552734, log-88364552735_88364556745, log-88364556746_88364560570, log-88364560571_88364564711, log-88364564712_88364568778, log-88364568779_88364572855, log-88364572856_88364577025, log-88364577026_88364580991, log-88364580992_88364585005, log-88364585006_88364589177, log-88364589178_88364593117, log-88364593118_88364596544, log-88364596545_88364600628, log-88364600629_88364604666, log-88364604667_88364608788, log-88364608789_88364612623, log-88364612624_88364616469, log-88364616470_88364620418, log-88364620419_88364624447, log-88364624448_88364628364, log-88364628365_88364632583, log-88364632584_88364636690, log-88364636691_88364640840, log-88364640841_88364645154, log-88364645155_88364649391, log-88364649392_88364653616, log-88364653617_88364657719, log-88364657720_88364662007, log-88364662008_88364666323, log-88364666324_88364670449, log-88364670450_88364674849, log-88364674850_88364679290, log-88364679291_88364683748, log-88364683749_88364688166, log-88364688167_88364692147, log-88364692148_88364696480, log-88364696481_88364700948, log-88364700949_88364705067, log-88364705068_88364709420, log-88364709421_88364713675, log-88364713676_88364718120, log-88364718121_88364722375, log-88364722376_88364726870, log-88364726871_88364731208, log-88364731209_88364735403, log-88364735404_88364739660, log-88364739661_88364744079, log-88364744080_88364748313, log-88364748314_88364752767, log-88364752768_88364756923, log-88364756924_88364761130, log-88364761131_88364765458, log-88364765459_88364769659, log-88364769660_88364773864, log-88364773865_88364778029, log-88364778030_88364782373, log-88364782374_88364786843, log-88364786844_88364791187, log-88364791188_88364795576, log-88364795577_88364799757, log-88364799758_88364804091, log-88364804092_88364808438, log-88364808439_88364812735, log-88364812736_88364817053, log-88364817054_88364821337, log-88364821338_88364825482, log-88364825483_88364829678, log-88364829679_88364833850, log-88364833851_88364838114, log-88364838115_88364842299, log-88364842300_88364846583, log-88364846584_88364849925, log-88364849926_88364854127, log-88364854128_88364858268, log-88364858269_88364862345, log-88364862346_88364866641, log-88364866642_88364870877, log-88364870878_88364875147, log-88364875148_88364879433, log-88364879434_88364883886, log-88364883887_88364888223, log-88364888224_88364892556, log-88364892557_88364896921, log-88364896922_88364901295, log-88364901296_88364905640, log-88364905641_88364909861, log-88364909862_88364914097, log-88364914098_88364918297, log-88364918298_88364922609, log-88364922610_88364926902, log-88364926903_88364931383, log-88364931384_88364935609, log-88364935610_88364940046, log-88364940047_88364944407, log-88364944408_88364948542, log-88364948543_88364952764, log-88364952765_88364956959, log-88364956960_88364961303, log-88364961304_88364965492, log-88364965493_88364969682, log-88364969683_88364973850, log-88364973851_88364978007, log-88364978008_88364982280, log-88364982281_88364986516, log-88364986517_88364990776, log-88364990777_88364995029, log-88364995030_88364999288] 

      When synchronizing the OM follower with the OM leader, we cleaned the OM ratis and ratis-snapshot directories and uses rsync to sync the OM DB (that contains the last applied index). Afterwards, we restart the slow OM follower which will receives the AppendEntries from the leader instead of the notifyInstallSnapshot due to the leader's purge preservation configuration. However, since the follower does not have some of the previous log segments, the first purge will trigger the  "Unexpected gap in segments" since the purge index is earlier than the first Raft log index in Ratis log directory.

      I suspect that this might also happen in general case for a new Raft server with raft.server.snapshot.auto.trigger.threshold and raft.server.log.purge.gap that are too small but with very large raft.server.log.purge.preservation.log.num, provided raft.server.log.purge.upto.snapshot.index is true.

      A possible solution is to not purge when the suggested index is lower than the first segmented log index, instead of throwing exception.

      Attachments

        1. 1175_review.patch
          6 kB
          Tsz-wo Sze

        Issue Links

          Activity

            People

              ivanandika Ivan Andika
              ivanandika Ivan Andika
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m