Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-18137

Replication gets stuck for empty WALs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.3.1
    • 1.4.0, 1.3.2, 2.0.0, 1.2.7
    • Replication
    • None
    • Reviewed
    • Hide
      0-length WAL files can potentially cause the replication queue to get stuck. A new config "replication.source.eof.autorecovery" has been added: if set to true (default is false), the 0-length WAL file will be skipped after 1) the max number of retries has been hit, and 2) there are more WAL files in the queue. The risk of enabling this is that there is a chance the 0-length WAL file actually has some data (e.g. block went missing and will come back once a datanode is recovered).
      Show
      0-length WAL files can potentially cause the replication queue to get stuck. A new config "replication.source.eof.autorecovery" has been added: if set to true (default is false), the 0-length WAL file will be skipped after 1) the max number of retries has been hit, and 2) there are more WAL files in the queue. The risk of enabling this is that there is a chance the 0-length WAL file actually has some data (e.g. block went missing and will come back once a datanode is recovered).

    Description

      Replication assumes that only the last WAL of a recovered queue can be empty. But, intermittent DFS issues may cause empty WALs being created (without the PWAL magic), and a roll of WAL to happen without a regionserver crash. This will cause recovered queues to have empty WALs in the middle. This cause replication to get stuck:

      TRACE regionserver.ReplicationSource: Opening log <wal_file>
      WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue> Got: 
      java.io.EOFException
      	at java.io.DataInputStream.readFully(DataInputStream.java:197)
      	at java.io.DataInputStream.readFully(DataInputStream.java:169)
      	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
      	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
      	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
      	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
      	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
      	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
      	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
      	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
      

      The WAL in question was completely empty but there were other WALs in the recovered queue which were newer and non-empty.

      Attachments

        1. HBASE-18137.branch-1.3.v1.patch
          8 kB
          Vincent Poon
        2. HBASE-18137.branch-1.3.v2.patch
          9 kB
          Vincent Poon
        3. HBASE-18137.branch-1.3.v3.patch
          11 kB
          Vincent Poon
        4. HBASE-18137.branch-1.v1.patch
          10 kB
          Vincent Poon
        5. HBASE-18137.branch-1.v2.patch
          10 kB
          Vincent Poon
        6. HBASE-18137.master.v1.patch
          10 kB
          Vincent Poon

        Issue Links

          Activity

            People

              vincentpoon Vincent Poon
              ashu210890 Ashu Pachauri
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: