HBase
  1. HBase
  2. HBASE-2575

Fault scenario of dead root drive on RS causes cluster lockup

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.90.0
    • Fix Version/s: None
    • Component/s: regionserver
    • Labels:
      None

      Description

      We performed a fault test where we physically pulled the root drive out of a machine while it was on. The regionserver continued to run fine with existing clients. But any new clients that tried to connect to it for RPC would not work correctly. So when I started a new client, that client made no progress. Despite this, the RS continued to happily heartbeat to the master, so the master did not remove it from the cluster. Note that in this case, we were logging to NFS, and the logs continued to write, but no exceptions shown.

        Activity

        Todd Lipcon created issue -
        Hide
        Benoit Sigoure added a comment -

        I believe we ran into a similar problem at StumbleUpon where the filesystem of one of the region servers sort of got into a wedged state. Any idea as to what could be causing this or how to fix it? Any idea on how to reproduce the problem easily (that is without physically pulling harddrives out)?

        Show
        Benoit Sigoure added a comment - I believe we ran into a similar problem at StumbleUpon where the filesystem of one of the region servers sort of got into a wedged state. Any idea as to what could be causing this or how to fix it? Any idea on how to reproduce the problem easily (that is without physically pulling harddrives out)?
        Hide
        Todd Lipcon added a comment -

        My thought to reproduce is something like this:

        1. dd if=/dev/zero of=myimage bs=1M count=1000
        2. losetup -f myimage
        3. mdadm --create /dev/md0 --level=faulty --raid-devices=1 /dev/loop1
        4. mkfs.ext3 /dev/md0
        5. mkdir /myhbase-disk
        6. mount /dev/md0 /myhbase-disk
        7. cp -a $HBASE_HOME /myhbase-disk
        8. start regionserver over there
        9. mdadm --grow /dev/md0 -l faulty -p read-persistent
        Show
        Todd Lipcon added a comment - My thought to reproduce is something like this: dd if=/dev/zero of=myimage bs=1M count=1000 losetup -f myimage mdadm --create /dev/md0 --level=faulty --raid-devices=1 /dev/loop1 mkfs.ext3 /dev/md0 mkdir /myhbase-disk mount /dev/md0 /myhbase-disk cp -a $HBASE_HOME /myhbase-disk start regionserver over there mdadm --grow /dev/md0 -l faulty -p read-persistent

          People

          • Assignee:
            Unassigned
            Reporter:
            Todd Lipcon
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development