Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2792

Automatically retry failed bootstrap on tablets that failed to start due to disk space

    XMLWordPrintableJSON

    Details

    • Type: Task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.8.0
    • Fix Version/s: None
    • Component/s: tserver
    • Labels:
      None

      Description

      If a tablet replica fails to bootstrap due to insufficient disk space to replay the WAL, it will remain in a state that looks like this in ksck, even if the user frees up disk space:

       

      5edf82f0516b4897b3a7991a7e67d71c (host1.example.com:7050): not running [LEADER]
       State: FAILED
       Data state: TABLET_DATA_READY
       Last status: IO error: Failed log replay. Reason: Failed to open new log: Insufficient disk space to allocate 8388608 bytes under path /data/1/kudu/tablet/wal/wals/5807c5100e0d4522a66e32efbb29d57e/.kudutmp.newsegmentzGFKEg (7939936256 bytes available vs 19993874923 bytes reserved) (error 28)
      

      Today, this requires a tablet server restart to recover from.

      It should be possible for a tablet server (i.e. the TsTabletManager) to detect that the failure was temporary, not permanent, and retry the failed bootstrap later on when additional disk space has been freed. From a programming perspective, that may require dealing with some object lifecycle issues (i.e. not reusing the Tablet object from the failed bootstrap).

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mpercy Mike Percy
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: