HBase
  1. HBase
  2. HBASE-8207

Replication could have data loss when machine name contains hyphen "-"

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.94.6, 0.95.0
    • Fix Version/s: 0.98.0, 0.94.7, 0.95.0
    • Component/s: Replication
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures.

      When a machine name contains "-", it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue:

      deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss)

      You can see that replication use those weird paths constructed from deadRegionServers to check a file existence

      2013-03-26 21:26:51,385 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,386 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,387 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,389 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,391 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,394 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,396 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,398 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      

      This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false

      Search for

      File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
      

      After 10 times retries, replication source gave up and move on to the next file. Data loss happens.

      Since lots of EC2 machine names contain "-" including our Jenkin servers, this is a high impact issue.

      1. failed.txt
        5.06 MB
        Jeffrey Zhong
      2. HBASE-8212-94.patch
        5 kB
        Jieshan Bean
      3. hbase-8207.patch
        5 kB
        Jeffrey Zhong
      4. hbase-8207_v1.patch
        9 kB
        Jeffrey Zhong
      5. hbase-8207_v2.patch
        9 kB
        Jeffrey Zhong
      6. hbase-8207-0.94-v1.patch
        9 kB
        Jeffrey Zhong
      7. hbase-8207_v2.patch
        9 kB
        Jeffrey Zhong
      8. 8207_v3.patch
        9 kB
        Ted Yu
      9. 8207-trunk-addendum.txt
        1 kB
        Ted Yu

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          FAILURE: Integrated in HBase-0.92-security #150 (See https://builds.apache.org/job/HBase-0.92-security/150/)
          HBASE-9154. [0.92] Backport HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (apurtell: rev 1511424)

          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMasterReplication.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMultiSlaveReplication.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java
          Show
          Hudson added a comment - FAILURE: Integrated in HBase-0.92-security #150 (See https://builds.apache.org/job/HBase-0.92-security/150/ ) HBASE-9154 . [0.92] Backport HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (apurtell: rev 1511424) /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMasterReplication.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMultiSlaveReplication.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in HBase-0.92 #614 (See https://builds.apache.org/job/HBase-0.92/614/)
          HBASE-9154. [0.92] Backport HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (apurtell: rev 1511424)

          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMasterReplication.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMultiSlaveReplication.java
          • /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java
          Show
          Hudson added a comment - FAILURE: Integrated in HBase-0.92 #614 (See https://builds.apache.org/job/HBase-0.92/614/ ) HBASE-9154 . [0.92] Backport HBASE-8207 Replication could have data loss when machine name contains hyphen - (Jeffrey) (apurtell: rev 1511424) /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMasterReplication.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestMultiSlaveReplication.java /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security-on-Hadoop-23 #13 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/13/)
          HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462554)
          HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462518)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #13 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/13/ ) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462554) HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462518) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #476 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/476/)
          HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463957)

          Result = FAILURE
          nkeywal :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #476 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/476/ ) HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463957) Result = FAILURE nkeywal : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Hide
          Hudson added a comment -

          Integrated in hbase-0.95-on-hadoop2 #53 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/53/)
          HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463958)

          Result = FAILURE
          nkeywal :
          Files :

          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
          • /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Show
          Hudson added a comment - Integrated in hbase-0.95-on-hadoop2 #53 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/53/ ) HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463958) Result = FAILURE nkeywal : Files : /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #4009 (See https://builds.apache.org/job/HBase-TRUNK/4009/)
          HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463957)

          Result = FAILURE
          nkeywal :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #4009 (See https://builds.apache.org/job/HBase-TRUNK/4009/ ) HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463957) Result = FAILURE nkeywal : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Hide
          Hudson added a comment -

          Integrated in hbase-0.95 #121 (See https://builds.apache.org/job/hbase-0.95/121/)
          HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463958)

          Result = SUCCESS
          nkeywal :
          Files :

          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java
          • /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Show
          Hudson added a comment - Integrated in hbase-0.95 #121 (See https://builds.apache.org/job/hbase-0.95/121/ ) HBASE-8207 Don't use hdfs append during lease recovery (Revision 1463958) Result = SUCCESS nkeywal : Files : /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSHDFSUtils.java /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestHLog.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #4000 (See https://builds.apache.org/job/HBase-TRUNK/4000/)
          HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462552)
          HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462515)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

          tedyu :
          Files :

          • /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #4000 (See https://builds.apache.org/job/HBase-TRUNK/4000/ ) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462552) HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462515) Result = FAILURE tedyu : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #927 (See https://builds.apache.org/job/HBase-0.94/927/)
          HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462554)
          HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462518)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #927 (See https://builds.apache.org/job/HBase-0.94/927/ ) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462554) HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462518) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Hide
          Hudson added a comment -

          Integrated in hbase-0.95 #113 (See https://builds.apache.org/job/hbase-0.95/113/)
          HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462553)
          HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462516)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

          tedyu :
          Files :

          • /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Show
          Hudson added a comment - Integrated in hbase-0.95 #113 (See https://builds.apache.org/job/hbase-0.95/113/ ) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462553) HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462516) Result = FAILURE tedyu : Files : /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Hide
          Hudson added a comment -

          Integrated in hbase-0.95-on-hadoop2 #47 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/47/)
          HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462553)
          HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462516)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

          tedyu :
          Files :

          • /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
          • /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Show
          Hudson added a comment - Integrated in hbase-0.95-on-hadoop2 #47 (See https://builds.apache.org/job/hbase-0.95-on-hadoop2/47/ ) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462553) HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462516) Result = FAILURE tedyu : Files : /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : /hbase/branches/0.95/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java /hbase/branches/0.95/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/branches/0.95/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #129 (See https://builds.apache.org/job/HBase-0.94-security/129/)
          HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462554)
          HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462518)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

          tedyu :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java
          • /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security #129 (See https://builds.apache.org/job/HBase-0.94-security/129/ ) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462554) HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462518) Result = FAILURE tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #468 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/468/)
          HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462552)
          HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462515)

          Result = FAILURE
          tedyu :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

          tedyu :
          Files :

          • /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          • /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #468 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/468/ ) HBASE-8207 Addendum fixes the typo in extractDeadServersFromZNodeString() (Revision 1462552) HBASE-8207 Replication could have data loss when machine name contains hyphen "-" (Jeffrey) (Revision 1462515) Result = FAILURE tedyu : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java tedyu : Files : /hbase/trunk/hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java
          Hide
          Ted Yu added a comment -

          Method name corrected in 0.94, 0.95 and trunk.

          Thanks for the finding, Enis.

          Show
          Ted Yu added a comment - Method name corrected in 0.94, 0.95 and trunk. Thanks for the finding, Enis.
          Hide
          Ted Yu added a comment -

          Addendum fixes the typo.

          Show
          Ted Yu added a comment - Addendum fixes the typo.
          Hide
          Enis Soztutar added a comment -

          Noticed a typo: extracDeadServersFromZNodeString -> extractDeadServersFromZNodeString.

          Show
          Enis Soztutar added a comment - Noticed a typo: extracDeadServersFromZNodeString -> extractDeadServersFromZNodeString.
          Hide
          Ted Yu added a comment -

          Integrated to 0.94, 0.95 and trunk.

          Thanks for the patch, Jeffrey.

          Thanks for the reviews, Lars, J-D and Jieshan.

          Show
          Ted Yu added a comment - Integrated to 0.94, 0.95 and trunk. Thanks for the patch, Jeffrey. Thanks for the reviews, Lars, J-D and Jieshan.
          Hide
          Ted Yu added a comment -

          Patch v3 wraps the long line.

          Show
          Ted Yu added a comment - Patch v3 wraps the long line.
          Hide
          Jean-Daniel Cryans added a comment -

          +1, love the added test.

          Show
          Jean-Daniel Cryans added a comment - +1, love the added test.
          Hide
          Ted Yu added a comment -

          There is a long line:

          +    SortedMap<String, SortedSet<String>> testMap = rz1.copyQueuesFromRSUsingMulti(server.getServerName().getServerName());
          

          Jean-Daniel Cryans:
          Do you have more comments ?

          Thanks

          Show
          Ted Yu added a comment - There is a long line: + SortedMap< String , SortedSet< String >> testMap = rz1.copyQueuesFromRSUsingMulti(server.getServerName().getServerName()); Jean-Daniel Cryans : Do you have more comments ? Thanks
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12576001/hbase-8207_v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 lineLengths. The patch introduces lines longer than 100

          -1 site. The patch appears to cause mvn site goal to fail.

          +1 core tests. The patch passed unit tests in .

          Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12576001/hbase-8207_v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified tests. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 lineLengths . The patch introduces lines longer than 100 -1 site . The patch appears to cause mvn site goal to fail. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5044//console This message is automatically generated.
          Hide
          Lars Hofhansl added a comment -

          I'd rather not do that (at least in 0.94) lest we get that wrong.

          Show
          Lars Hofhansl added a comment - I'd rather not do that (at least in 0.94) lest we get that wrong.
          Hide
          Jieshan Bean added a comment -

          New patch also looks good to me. Is it neccessary to add restrictions on peer-id when calling add_peer?

          Show
          Jieshan Bean added a comment - New patch also looks good to me. Is it neccessary to add restrictions on peer-id when calling add_peer?
          Hide
          Lars Hofhansl added a comment -

          Looks good to me. replication.source.size.capacity is small to cause lot's of activity, 10k is OK.

          +1 on this one.

          Show
          Lars Hofhansl added a comment - Looks good to me. replication.source.size.capacity is small to cause lot's of activity, 10k is OK. +1 on this one.
          Hide
          Jeffrey Zhong added a comment -

          I incorporated the two feedbacks from Ted and submitted two patches one for trunk and one for 0.94.

          In 0.94 one, I also increase the setting "replication.source.size.capacity" to make the test run fast because once a while the test cases failed with real timeout.

          Thanks,
          -Jeffrey

          Show
          Jeffrey Zhong added a comment - I incorporated the two feedbacks from Ted and submitted two patches one for trunk and one for 0.94. In 0.94 one, I also increase the setting "replication.source.size.capacity" to make the test run fast because once a while the test cases failed with real timeout. Thanks, -Jeffrey
          Hide
          Ted Yu added a comment -
          +    ReplicationSource.extracDeadServersFromZNodeString(parts[1], this.deadRegionServers);
          

          Class name, ReplicationSource, is not needed, right ?

          +  public void testNodeFailoveDeadServerParsing() throws Exception {
          

          Typo: the 'ove' in Failover

          Show
          Ted Yu added a comment - + ReplicationSource.extracDeadServersFromZNodeString(parts[1], this .deadRegionServers); Class name, ReplicationSource, is not needed, right ? + public void testNodeFailoveDeadServerParsing() throws Exception { Typo: the 'ove' in Failover
          Hide
          Jeffrey Zhong added a comment -

          Lars Hofhansl Jean-Daniel Cryans You're right. The sleep is not needed.

          I found another bug inside function copyQueuesFromRSUsingMulti where

                  if (hlogs == null || hlogs.size() == 0) {
                    continue; // empty log queue.
                  }
          

          The continue skips the empty folder deletion

          listOfOps.add(ZKUtilOp.deleteNodeFailSilent(oldClusterZnode));

          The skip leave its parent folder as non-empty and it then fails the root folder deletion

          ZKUtilOp.deleteNodeFailSilent(deadRSZnodePath)

          In the end, the whole function fails and it is a scary thing.

          I came up a patch for trunk. The new patch includes all fixes, feedbacks and a new test case so we can uncover this issue much sooner. I'll submit a similar patch to 0.94 soon.

          Thanks,
          -Jeffrey

          Show
          Jeffrey Zhong added a comment - Lars Hofhansl Jean-Daniel Cryans You're right. The sleep is not needed. I found another bug inside function copyQueuesFromRSUsingMulti where if (hlogs == null || hlogs.size() == 0) { continue ; // empty log queue. } The continue skips the empty folder deletion listOfOps.add(ZKUtilOp.deleteNodeFailSilent(oldClusterZnode)); The skip leave its parent folder as non-empty and it then fails the root folder deletion ZKUtilOp.deleteNodeFailSilent(deadRSZnodePath) In the end, the whole function fails and it is a scary thing. I came up a patch for trunk. The new patch includes all fixes, feedbacks and a new test case so we can uncover this issue much sooner. I'll submit a similar patch to 0.94 soon. Thanks, -Jeffrey
          Hide
          Jean-Daniel Cryans added a comment -

          Mmm sorry about this one, guys. I guess that's what happens when you test only in one environment.

          1) When replication is waiting for log splitting complete, there is no sleep in between so we keep hitting hdfs name node

          Since the HLog is always "somewhere", AFAIK we don't wait for log splitting to finish.

          About the patch, it seems that the "extract dead servers" logic should itself be extracted into its own method.

          Show
          Jean-Daniel Cryans added a comment - Mmm sorry about this one, guys. I guess that's what happens when you test only in one environment. 1) When replication is waiting for log splitting complete, there is no sleep in between so we keep hitting hdfs name node Since the HLog is always "somewhere", AFAIK we don't wait for log splitting to finish. About the patch, it seems that the "extract dead servers" logic should itself be extracted into its own method.
          Hide
          Ted Yu added a comment - - edited

          @Jeff:
          Patch is great: we avoid incompatibilities for the next 0.94 release.
          Actually we can use your patch for 0.94 and switch to new delimiter in 0.95 / 0.98 so that code maintenance work is less.

          +  // Since servername can contains "-" like "ip-10-46-221-101.ec2.internal", so we need skip some
          
          can contains "-" like

          =>

          can contain "-", such as

          'we need skip' -> 'we need to skip'

          +  // 2-ip-10-46-221-101.ec2.internal,52170,1364333181125-ip-10-46-221-101.ec2.internal,52171,
          +  // 1364333181127-...
          

          Normally we don't want long line. But for the above sample, we'd better keep it on the same line.

          +    // extract dead servers
          +    if (parts.length > 1) {
          

          Since there is no else block, you can revert the condition and bail out early.

          +      // valid server name delimiter "-" has to be after ","
          

          'after ","' -> 'after "," in ServerName'

          +                LOG.error("Found invalid server name:" + serverName);
          

          The above log sentence is the same for first and second server name candidates. It would be nice to use slightly different terms.

          +      if(startIndex < len - 1){
          

          I think we should add a log statement for the else block of the above check - basically there is no second server name in that case.

          One general minor comment is that you should insert space between if and the following left paren.

          Show
          Ted Yu added a comment - - edited @Jeff: Patch is great: we avoid incompatibilities for the next 0.94 release. Actually we can use your patch for 0.94 and switch to new delimiter in 0.95 / 0.98 so that code maintenance work is less. + // Since servername can contains "-" like "ip-10-46-221-101.ec2.internal" , so we need skip some can contains "-" like => can contain "-" , such as 'we need skip' -> 'we need to skip' + // 2-ip-10-46-221-101.ec2.internal,52170,1364333181125-ip-10-46-221-101.ec2.internal,52171, + // 1364333181127-... Normally we don't want long line. But for the above sample, we'd better keep it on the same line. + // extract dead servers + if (parts.length > 1) { Since there is no else block, you can revert the condition and bail out early. + // valid server name delimiter "-" has to be after "," 'after ","' -> 'after "," in ServerName' + LOG.error( "Found invalid server name:" + serverName); The above log sentence is the same for first and second server name candidates. It would be nice to use slightly different terms. + if (startIndex < len - 1){ I think we should add a log statement for the else block of the above check - basically there is no second server name in that case. One general minor comment is that you should insert space between if and the following left paren.
          Hide
          Ted Yu added a comment -

          @Jeffrey:
          Can you upload patch for 0.94 ?

          I wonder if we should integrate Jeff's patch to 0.94 first where we can verify on EC2 cluster that the problem is fixed.

          Thanks

          Show
          Ted Yu added a comment - @Jeffrey: Can you upload patch for 0.94 ? I wonder if we should integrate Jeff's patch to 0.94 first where we can verify on EC2 cluster that the problem is fixed. Thanks
          Hide
          Lars Hofhansl added a comment -

          +1 on Jeffrey's patch.

          Is this needed?

          +                // Sleep fixed interval to wait for log splitting work get done
          +                this.sleepForRetries("waiting for log splitting is done", 1);
          
          Show
          Lars Hofhansl added a comment - +1 on Jeffrey's patch. Is this needed? + // Sleep fixed interval to wait for log splitting work get done + this .sleepForRetries( "waiting for log splitting is done" , 1);
          Hide
          Jeffrey Zhong added a comment -

          Jieshan Bean Thanks for the patch! Changing delimiter increase the existing user burden to apply the patch. I'm fine if we change it for 0.96 and onwards while different code base adds more work to support & developers in the future.

          I came up a patch which keeps existing "-". The reason we can do that is that we only have one place to consume(parse) the znode string so we can handle the situation.

          Please let me know what do you think?

          In the patch, we also include two small fixes:

          1) When replication is waiting for log splitting complete, there is no sleep in between so we keep hitting hdfs name node
          2) Replication checks possible file locations in reverse order, while in reality the possible log splitting location is from the first failed RS. After the fix, we can cut trips to NN too.

          I'm still thinking how to include this in our tests so we can catch this error earlier.

          Thanks,
          -Jeffrey

          Show
          Jeffrey Zhong added a comment - Jieshan Bean Thanks for the patch! Changing delimiter increase the existing user burden to apply the patch. I'm fine if we change it for 0.96 and onwards while different code base adds more work to support & developers in the future. I came up a patch which keeps existing "-". The reason we can do that is that we only have one place to consume(parse) the znode string so we can handle the situation. Please let me know what do you think? In the patch, we also include two small fixes: 1) When replication is waiting for log splitting complete, there is no sleep in between so we keep hitting hdfs name node 2) Replication checks possible file locations in reverse order, while in reality the possible log splitting location is from the first failed RS. After the fix, we can cut trips to NN too. I'm still thinking how to include this in our tests so we can catch this error earlier. Thanks, -Jeffrey
          Hide
          Andrew Purtell added a comment -

          Nice find. This has been vexing.

          Show
          Andrew Purtell added a comment - Nice find. This has been vexing.
          Hide
          Jieshan Bean added a comment -

          We found the same problem in our test environment, attaching the logs for your reference:

          2013-03-25 04:51:20,929 INFO  [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] NB dead servers : 4 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:517)
          2013-03-25 04:51:20,929 INFO  [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/130,60020,1364199883591/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
          2013-03-25 04:51:20,930 INFO  [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/130,60020,1364199883591-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
          2013-03-25 04:51:20,932 INFO  [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/0/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
          2013-03-25 04:51:20,934 INFO  [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/0-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
          2013-03-25 04:51:20,935 INFO  [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/172/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
          2013-03-25 04:51:20,937 INFO  [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/172-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
          2013-03-25 04:51:20,938 INFO  [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/160/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
          2013-03-25 04:51:20,939 INFO  [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/160-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528)
          2013-03-25 04:51:20,941 WARN  [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 1-160-172-0-130,60020,1364199883591 Got:  org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563)
          java.io.IOException: File from recovered queue is nowhere to be found
          	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:545)
          	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:311)
          Caused by: java.io.FileNotFoundException: File does not exist: hdfs://hacluster/hbase/.oldlogs/160-172-0-130%2C60020%2C1364199883591.1364200564291
          	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:752)
          	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1692)
          	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1716)
          	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
          	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
          	at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
          	at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
          	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:511)
          	... 1 more
          
          Show
          Jieshan Bean added a comment - We found the same problem in our test environment, attaching the logs for your reference: 2013-03-25 04:51:20,929 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] NB dead servers : 4 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:517) 2013-03-25 04:51:20,929 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/130,60020,1364199883591/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,930 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/130,60020,1364199883591-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,932 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/0/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,934 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/0-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,935 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/172/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,937 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/172-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,938 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/160/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,939 INFO [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] Possible location hdfs://hacluster/hbase/.logs/160-splitting/160-172-0-130%252C60020%252C1364199883591.1364200564291 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:528) 2013-03-25 04:51:20,941 WARN [ReplicationExecutor-0.replicationSource,1-160-172-0-130,60020,1364199883591] 1-160-172-0-130,60020,1364199883591 Got: org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563) java.io.IOException: File from recovered queue is nowhere to be found at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:545) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:311) Caused by: java.io.FileNotFoundException: File does not exist: hdfs://hacluster/hbase/.oldlogs/160-172-0-130%2C60020%2C1364199883591.1364200564291 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:752) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1692) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1716) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:511) ... 1 more
          Hide
          Jieshan Bean added a comment -

          I have finished one patch which using "#" instead of "-". Sorry, I raised a same issue but didn't notice this has already been there

          Show
          Jieshan Bean added a comment - I have finished one patch which using "#" instead of "-". Sorry, I raised a same issue but didn't notice this has already been there
          Hide
          Jeffrey Zhong added a comment -

          Ted Yu It depends if log splitting can finish in time before replication gives up. In our test case setup, log splitting normally completes within 1-2 secs while replication takes about 5 secs to give up. So in most cases, the test runs fine. In local dev env, we don't have machine name containing "-" so we can't even reproduce it.

          Since we just changed Jenkins, I can't find more build history. From my "flaky test detector" tool(HBASE-8018) result in trunk which I ran on March 5. We can see replication in trunk is flaky at that time:

          HBase-TRUNK (from last 10 runs)

          Failed Test Cases              3908 3909 3910 3912 3913 3914 3915 3916
          ========================================================
          org.apache.hadoop.hbase.replication.testreplicationqueuefailover.queuefailover    1    1    1   -1    0   -1    0    1
          org.apache.hadoop.hbase.replication.testreplicationqueuefailovercompressed.queuefailover    1    1    1   -1    0   -1    0    1

          HBase-0.95 (from last 10 runs configurable)

          Failed Test Cases                21   22   23   24   25   27
          ========================================================
          org.apache.hadoop.hbase.replication.testreplicationqueuefailover.queuefailover    1   -1    0    1   -1    0
          org.apache.hadoop.hbase.replication.testreplicationqueuefailovercompressed.queuefailover    0    1   -1    0   -1    0

          Show
          Jeffrey Zhong added a comment - Ted Yu It depends if log splitting can finish in time before replication gives up. In our test case setup, log splitting normally completes within 1-2 secs while replication takes about 5 secs to give up. So in most cases, the test runs fine. In local dev env, we don't have machine name containing "-" so we can't even reproduce it. Since we just changed Jenkins, I can't find more build history. From my "flaky test detector" tool( HBASE-8018 ) result in trunk which I ran on March 5. We can see replication in trunk is flaky at that time: HBase-TRUNK (from last 10 runs) Failed Test Cases              3908 3909 3910 3912 3913 3914 3915 3916 ======================================================== org.apache.hadoop.hbase.replication.testreplicationqueuefailover.queuefailover    1    1    1   -1    0   -1    0    1 org.apache.hadoop.hbase.replication.testreplicationqueuefailovercompressed.queuefailover    1    1    1   -1    0   -1    0    1 HBase-0.95 (from last 10 runs configurable) Failed Test Cases                21   22   23   24   25   27 ======================================================== org.apache.hadoop.hbase.replication.testreplicationqueuefailover.queuefailover    1   -1    0    1   -1    0 org.apache.hadoop.hbase.replication.testreplicationqueuefailovercompressed.queuefailover    0    1   -1    0   -1    0
          Hide
          Ted Yu added a comment -

          I wonder why trunk builds didn't encounter such problem.
          From http://54.241.6.143/job/HBase-TRUNK/45/console:

          Running org.apache.hadoop.hbase.replication.TestReplicationQueueFailoverCompressed
          Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 66.645 sec
          ...
          Running org.apache.hadoop.hbase.replication.TestReplicationQueueFailover
          Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 77.707 sec
          
          Show
          Ted Yu added a comment - I wonder why trunk builds didn't encounter such problem. From http://54.241.6.143/job/HBase-TRUNK/45/console: Running org.apache.hadoop.hbase.replication.TestReplicationQueueFailoverCompressed Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 66.645 sec ... Running org.apache.hadoop.hbase.replication.TestReplicationQueueFailover Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 77.707 sec
          Hide
          Enis Soztutar added a comment -

          BTW, for future reviews, we should ensure that carrying data around znodes names, and file names, and string parsing etc should be avoided, unless there is a valid reason.

          Show
          Enis Soztutar added a comment - BTW, for future reviews, we should ensure that carrying data around znodes names, and file names, and string parsing etc should be avoided, unless there is a valid reason.
          Hide
          Lars Hofhansl added a comment -

          I like ':' as separator. We can do this in the backwards compatible way: Since we know that ':' will not be in a valid hostname it won't be in the concatenated strings. So when read the data from ZK in ReplicationSource, we check whether we find a ':', if we do we split on that, otherwise we split on '-' as we do now. We will always write in the new format.
          Rolling restarts will still be a problem, though.

          Show
          Lars Hofhansl added a comment - I like ':' as separator. We can do this in the backwards compatible way: Since we know that ':' will not be in a valid hostname it won't be in the concatenated strings. So when read the data from ZK in ReplicationSource, we check whether we find a ':', if we do we split on that, otherwise we split on '-' as we do now. We will always write in the new format. Rolling restarts will still be a problem, though.
          Hide
          Lars Hofhansl added a comment -

          The same is in ReplicationZookeeper.copyQueuesFromRS, so at least it is nothing new.

          Show
          Lars Hofhansl added a comment - The same is in ReplicationZookeeper.copyQueuesFromRS, so at least it is nothing new.
          Hide
          Jeffrey Zhong added a comment -

          Enis Soztutar This could be a production issue if a user setups HBase with replication enabled in EC2 like env.

          Jimmy Xiang I thought about to change to another delimiter ":" which is NOT allowed by machine name nor hdfs file name. While it will cause a little bit migration issue. Since the znode value could be either a peerId or peerId-<servername>-<servername>... and server name has to contain ","

          Therefore, I'm planning to keep the current way just patch the function checkIfQueueRecovered so before treating "-" as valid server name delimiter we have to see a "," before. In 0.96 or onwords we could use a different char like ":" which isn't allowed in either machine name or hdfs.

          Thanks,
          -Jeffrey

          Show
          Jeffrey Zhong added a comment - Enis Soztutar This could be a production issue if a user setups HBase with replication enabled in EC2 like env. Jimmy Xiang I thought about to change to another delimiter ":" which is NOT allowed by machine name nor hdfs file name. While it will cause a little bit migration issue. Since the znode value could be either a peerId or peerId-<servername>-<servername>... and server name has to contain "," Therefore, I'm planning to keep the current way just patch the function checkIfQueueRecovered so before treating "-" as valid server name delimiter we have to see a "," before. In 0.96 or onwords we could use a different char like ":" which isn't allowed in either machine name or hdfs. Thanks, -Jeffrey
          Hide
          Jean-Marc Spaggiari added a comment -

          Any thing which is not allowed as an host name might be acceptable, like pipe, % or like Jimmy Xiang proposed, a non printable delimiter. But if we need to print that on the logs, then the pipe option might be better? also, if this is stored in ZK, then indeed, we might need to keep the existing "-" + new delimiter for the reads, and the new delimiter for the writes...

          Show
          Jean-Marc Spaggiari added a comment - Any thing which is not allowed as an host name might be acceptable, like pipe, % or like Jimmy Xiang proposed, a non printable delimiter. But if we need to print that on the logs, then the pipe option might be better? also, if this is stored in ZK, then indeed, we might need to keep the existing "-" + new delimiter for the reads, and the new delimiter for the writes...
          Hide
          Jimmy Xiang added a comment -

          A non-printable/configurable delimiter?

          Show
          Jimmy Xiang added a comment - A non-printable/configurable delimiter?
          Hide
          Jimmy Xiang added a comment -

          We need to make sure there is a proper migration plan, at least mentioning the fix in release notes.

          Show
          Jimmy Xiang added a comment - We need to make sure there is a proper migration plan, at least mentioning the fix in release notes.
          Hide
          Ted Yu added a comment -

          How about using '=' as the separator ?

          Show
          Ted Yu added a comment - How about using '=' as the separator ?
          Hide
          Enis Soztutar added a comment -

          I think the problem comes from:

          ReplicationZookeeper.copyQueuesFromRSUsingMulti

          src/main/java/org/apache/hadoop/hbase/replication//ReplicationZookeeper.java:        String newPeerId = peerId + "-" + znode;
          src/main/java/org/apache/hadoop/hbase/replication//regionserver/ReplicationSource.java:    String[] parts = peerClusterZnode.split("-");
          

          Fixing it should not be that hard. Jeffrey, this is a production bug as well as test, right?

          Show
          Enis Soztutar added a comment - I think the problem comes from: ReplicationZookeeper.copyQueuesFromRSUsingMulti src/main/java/org/apache/hadoop/hbase/replication //ReplicationZookeeper.java: String newPeerId = peerId + "-" + znode; src/main/java/org/apache/hadoop/hbase/replication //regionserver/ReplicationSource.java: String [] parts = peerClusterZnode.split( "-" ); Fixing it should not be that hard. Jeffrey, this is a production bug as well as test, right?
          Hide
          Lars Hofhansl added a comment -
          Show
          Lars Hofhansl added a comment - Jean-Daniel Cryans FYI.
          Hide
          Lars Hofhansl added a comment -

          Making critical. This is bad.

          Show
          Lars Hofhansl added a comment - Making critical. This is bad.
          Hide
          Jeffrey Zhong added a comment -

          Attached the log of the failed test case in case Jenkin server clean the log later.

          Show
          Jeffrey Zhong added a comment - Attached the log of the failed test case in case Jenkin server clean the log later.

            People

            • Assignee:
              Jeffrey Zhong
              Reporter:
              Jeffrey Zhong
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development