Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26120

New replication gets stuck or data loss when multiwal groups more than 10

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.7.1, 2.4.5
    • 2.5.0, 2.3.6, 2.4.5, 1.7.2
    • Replication
    • None
    • Reviewed

    Description

      void preLogRoll(Path newLog) throws IOException {
        recordLog(newLog);
        String logName = newLog.getName();
        String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
        synchronized (latestPaths) {
          Iterator<Path> iterator = latestPaths.iterator();
          while (iterator.hasNext()) {
            Path path = iterator.next();
            if (path.getName().contains(logPrefix)) {
              iterator.remove();
              break;
            }
          }
          this.latestPaths.add(newLog);
        }
      }
      

      ReplicationSourceManager use latestPaths to track each walgroup's last WALlog and all of them will be enqueue for replication when new replication  peer added。

      If we set hbase.wal.regiongrouping.numgroups > 10, says 12, the name of WALlog group will be regionserver.null0.timestamp to regionserver.null11.timestampString.contains is used in preoLogRoll to replace old logs in same group, leads when regionserver.null1.ts comes, regionserver.null11.ts may be replaced, and latestPaths growing with wrong logs.

      Replication then partly stuckd as regionsserver.null1.ts not exists on hdfs, and data may not be replicated to slave as regionserver.null11.ts not in replication queue at startup.

      Because of ZOOKEEPER-706, if there is too many logs in zk /hbase/replication/rs/regionserver/peer, remove_peer may not delete this znode, and other regionserver can't not pick up this queue for replication failover. 

      Attachments

        Issue Links

          Activity

            People

              zhangduo Duo Zhang
              jasee Jasee Tao
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: