Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-25596

Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      There seems to be a major issue with how we handle the EOF exception from WALEntryStream. 

      Problem:

      When we see EOFException, we try to handle it and remove it from the log queue, but we never try to ship the existing batch of entries. This is a permanent data loss in replication.

       

      Secondly, we do not stop the reader on encountering the EOFException and thus if EOFException was on the last WAL, we still try to process the WALEntry stream and ship the empty batch with lastWALPath set to null. This is the reason of NPE as below which crash the region server. 

      2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] regionserver.ReplicationSource - Unexpected exception in ReplicationSourceWorkerThread, currentPath=nulljava.lang.NullPointerExceptionat org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16 15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - STOPPED: Unexpected exception in ReplicationSourceWorkerThread
      

       

       

      Attachments

        Activity

          People

            sandeep.pal Sandeep Pal
            sandeep.pal Sandeep Pal
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: