Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26482

HMaster may clean wals that is replicating in rare cases

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      In our cluster, i can found some FileNotFoundException when ReplicationSourceWALReader running for replication recovery queue.

      I guss the wal most likely removed by hmaste. And i found something to support it.

      The method getAllWALs: https://github.com/apache/hbase/blob/master/hbase-replication/src/main/java/org/apache/hadoop/hbase/replication/ZKReplicationQueueStorage.java#L509   Use zk cversion of /hbase/replication/rs as an optimistic lock to control concurrent ops.

      But, zk cversion only can only reflect the changes of child nodes, but not the changes of grandchildren.

      So, HMaster may loss some wal from this method in follow situation.

      1. HMaster do log clean , and invoke getAllWALs to filter log which should not be deleted.
      2. HMaster cache current cversion of /hbase/replication/rs  as v0
      3. HMaster cache all RS server name, and traverse them, get the WAL in each Queue
      4. RS2 dead after HMaster traverse RS1, and before traverse RS2
      5. RS1 claim one queue of RS2, which named peerid-RS2 now
      6. By the way , the cversion of /hbase/replication/rs not changed before all of RS2 queue is removed, because the children of /hbase/replication/rs not change.
      7. So, Hmaster will lost the wals in peerid-RS2, because we have already traversed RS1 , and ** this queue not exists in RS2

      The above expression is currently only speculation, not confirmed

      Flie Not Found Log.

       

      // code placeholder
      2021-11-22 15:18:39,593 ERROR [ReplicationExecutor-0.replicationSource,peer_id-hostname,60020,1636802867348.replicationSource.wal-reader.hostname%2C60020%2C1636802867348,peer_id-hostname,60020,1636802867348] regionserver.WALEntryStream: Couldn't locate log: hdfs://namenode/hbase/oldWALs/hostname%2C60020%2C1636802867348.1636944748704
      2021-11-22 15:18:39,593 ERROR [ReplicationExecutor-0.replicationSource,peer_id-hostname,60020,1636802867348.replicationSource.wal-reader.hostname%2C60020%2C1636802867348,peer_id-hostname,60020,1636802867348] regionserver.ReplicationSourceWALReader: Failed to read stream of replication entries
      java.io.FileNotFoundException: File does not exist: hdfs://namenode/hbase/oldWALs/hostname%2C60020%2C1636802867348.1636944748704
              at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1612)
              at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1605)
              at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
              at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1620)
              at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:64)
              at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:168)
              at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:321)
              at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303)
              at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:291)
              at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:427)
              at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:355)
              at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:303)
              at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:294)
              at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:175)
              at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192)
              at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138) 

       

       

      Attachments

        Issue Links

          Activity

            People

              zhengzhuobinzzb zhuobin zheng
              zhengzhuobinzzb zhuobin zheng
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: