Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-27871

Meta replication stuck forever if wal it's still reading gets rolled and deleted

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.6.0, 2.4.16, 2.4.17, 2.5.4
    • 2.6.0, 2.4.18, 2.5.6
    • meta replicas
    • None

    Description

      This affects branch-2 based releases only (in master, HBASE-26416 refactored region replication to not rely on the replication framework anymore).

      Per the original meta region replicas design, we use most of the replication framework for communicating changes in the primary replica back to the secondary ones, but we skip storing the queue state in ZK. In the event of a region replication crash, we should let the related replication source thread be interrupted, so that 
      RegionReplicaReplicationEndpoint would set a new source from the scratch and make sure to update the secondary replicas.
       
      We have run into a situation in one of our customers' cluster where the region replica source faced a long lag (probably because the RSes hosting the secondary replicas were busy and slower in processing the region replication entries), so that the current wal got rolled and eventually deleted whilst the replication source reader was still referring it. In such cases, ReplicationSourceReader only sees the IOException and keeps retrying the read indefinitely, but since the file is now gone, it will just get stuck there forever. In the particular case of FNFE (which I believe would only happen for region replication), we should just raise an exception and let RegionReplicaReplicationEndpoint handle it to reset the region replication source.
       
       

      Attachments

        Activity

          People

            wchevreuil Wellington Chevreuil
            wchevreuil Wellington Chevreuil
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: