Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2040

Coordinate data dir lifecycle with DataDirGroups



    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: fs, tserver
    • Labels:


      At the time of creation, a tablet's DataDirGroup will avoid using directories that are full and directories that have failed. This can lead to the creation of groups that are below the flag-specified target number of dirs. This isn't necessarily a error, but if the disks do come back to a healthy state, there is no way to resize an undersized group.

      The assumption in this implementation is that these states are permanent, which isn't necessarily the case. A full disk may have tablets removed; when disk refreshes become supported by Kudu, disk failure will also become transient. As such, it's worth considering if/when/how undersized DataDirGroups should be resized.

      A few notes on this:

      • once a disk group has been created, the tablet's data will be spread across the disks in that group, so completely changing the group will require that the tablet's data is rewritten
      • another approach might be to replicate the understriped tablet (either on the same server or elsewhere) in hopes that more disks are available
      • as of writing this, recovery from a disk failure is not implemented, so disk failure is currently not considered transient (this will change once it is implemented)




            • Assignee:
              awong Andrew Wong
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created: