Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-18137

Replication gets stuck for empty WALs

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.3.1
    • Fix Version/s: 1.4.0, 1.3.2, 2.0.0, 1.2.7
    • Component/s: Replication
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      0-length WAL files can potentially cause the replication queue to get stuck. A new config "replication.source.eof.autorecovery" has been added: if set to true (default is false), the 0-length WAL file will be skipped after 1) the max number of retries has been hit, and 2) there are more WAL files in the queue. The risk of enabling this is that there is a chance the 0-length WAL file actually has some data (e.g. block went missing and will come back once a datanode is recovered).
      Show
      0-length WAL files can potentially cause the replication queue to get stuck. A new config "replication.source.eof.autorecovery" has been added: if set to true (default is false), the 0-length WAL file will be skipped after 1) the max number of retries has been hit, and 2) there are more WAL files in the queue. The risk of enabling this is that there is a chance the 0-length WAL file actually has some data (e.g. block went missing and will come back once a datanode is recovered).

      Description

      Replication assumes that only the last WAL of a recovered queue can be empty. But, intermittent DFS issues may cause empty WALs being created (without the PWAL magic), and a roll of WAL to happen without a regionserver crash. This will cause recovered queues to have empty WALs in the middle. This cause replication to get stuck:

      TRACE regionserver.ReplicationSource: Opening log <wal_file>
      WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue> Got: 
      java.io.EOFException
      	at java.io.DataInputStream.readFully(DataInputStream.java:197)
      	at java.io.DataInputStream.readFully(DataInputStream.java:169)
      	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
      	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
      	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
      	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
      	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
      	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
      	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
      	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
      

      The WAL in question was completely empty but there were other WALs in the recovered queue which were newer and non-empty.

        Attachments

        1. HBASE-18137.branch-1.3.v1.patch
          8 kB
          Vincent Poon
        2. HBASE-18137.branch-1.3.v2.patch
          9 kB
          Vincent Poon
        3. HBASE-18137.branch-1.3.v3.patch
          11 kB
          Vincent Poon
        4. HBASE-18137.branch-1.v1.patch
          10 kB
          Vincent Poon
        5. HBASE-18137.branch-1.v2.patch
          10 kB
          Vincent Poon
        6. HBASE-18137.master.v1.patch
          10 kB
          Vincent Poon

          Issue Links

            Activity

              People

              • Assignee:
                vincentpoon Vincent Poon
                Reporter:
                ashu210890 Ashu Pachauri
              • Votes:
                0 Vote for this issue
                Watchers:
                16 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: