Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-3604

Two region servers think that they own the same region: data loss

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 0.90.0
    • None
    • regionserver
    • None

    Description

      I observed this on a 100 node cluster that is constantly doing about 500K ops/second.

      The region server on machine A was servicing IOs for a particular region. Then the machine went into a bad state where it is ping-able but not ssh-able. The master detected that there is a problem with machine A and reassigned the region to machine B. The regionserver on machine B opened the region and opened all the required HFiles for this region. After two hours, the NameNode received a delete request for one of the HFiles from machine A and happily renamed the file to HDFS-Trash. After another 3 hours or so, the regionserver on machine B tried to read contents from that HFile but failed because the file was renamed earlier. The region server on B in now stuck, and possible data loss.

      The problems stems from the fact that although the master-and-ZK reassigned the region, the old regionserver was not possibly dead.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dhruba Dhruba Borthakur
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: