HBase
  1. HBase
  2. HBASE-6649

[0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1]

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.94.2
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Have seen it twice in the recent past: http://bit.ly/MPCykB & http://bit.ly/O79Dq7 ..

      Looking briefly at the logs hints at a pattern - in both the failed test instances, there was an RS crash while the test was running.

      1. HBase-0.92 #502 test - queueFailover [Jenkins].html
        2.37 MB
        Devaraj Das
      2. HBase-0.92 #495 test - queueFailover [Jenkins].html
        2.48 MB
        Devaraj Das
      3. 6649-trunk.patch
        0.8 kB
        Devaraj Das
      4. 6649-trunk.patch
        1 kB
        Devaraj Das
      5. 6649-fix-io-exception-handling-1-trunk.patch
        3 kB
        Devaraj Das
      6. 6649-fix-io-exception-handling-1.patch
        3 kB
        Devaraj Das
      7. 6649-fix-io-exception-handling.patch
        2 kB
        Devaraj Das
      8. 6649-2.txt
        0.7 kB
        Ted Yu
      9. 6649-1.patch
        0.8 kB
        Devaraj Das
      10. 6649-0.92.patch
        1 kB
        Devaraj Das
      11. 6649.txt
        0.9 kB
        stack

        Issue Links

          Activity

          Hide
          Ted Yu added a comment -

          I found the following for a hanging TestReplication run in trunk:

          2012-08-25 20:43:46,903 WARN  [Master:0;sea-lab-0,43158,1345952626654] master.AssignmentManager(1606): Failed assignment of -ROOT-,,0.70236052 to sea-lab-0,60237,1345952626692, trying to assign elsewhere instead; retry=0
          org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet
            at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
            at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
            at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
            at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
            at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
            at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
            at org.apache.hadoop.hbase.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:187)
            at $Proxy17.openRegion(Unknown Source)
            at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:500)
            at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1587)
            at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1256)
            at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1226)
            at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1221)
            at org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:2103)
            at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:785)
            at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:665)
            at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:439)
            at java.lang.Thread.run(Thread.java:662)
          Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet
            at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1766)
          
            at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1288)
            at org.apache.hadoop.hbase.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:178)
            ... 11 more
          
          2012-08-25 20:43:46,903 INFO  [Master:0;sea-lab-0,43158,1345952626654] master.RegionStates(250): Region {NAME => '-ROOT-,,0', STARTKEY => '', ENDKEY => '', ENCODED => 70236052,} transitioned from {-ROOT-,,0.70236052 state=PENDING_OPEN, ts=1345952626860, server=sea-lab-0,60237,1345952626692} to {-ROOT-,,0.70236052 state=OFFLINE, ts=1345952626903, server=null}
          
          2012-08-25 20:43:46,903 WARN  [Master:0;sea-lab-0,43158,1345952626654] master.AssignmentManager(1772): Can't move the region 70236052, there is no destination server available.
          
          2012-08-25 20:43:46,903 WARN  [Master:0;sea-lab-0,43158,1345952626654] master.AssignmentManager(1618): Unable to find a viable location to assign region -ROOT-,,0.70236052
          

          I suggest the following change:

          Index: hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java
          ===================================================================
          --- hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java	(revision 1377368)
          +++ hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java	(working copy)
          @@ -520,7 +520,7 @@
           
               // disable and start the peer
               admin.disablePeer("2");
          -    utility2.startMiniHBaseCluster(1, 1);
          +    utility2.startMiniHBaseCluster(1, 2);
               Get get = new Get(rowkey);
               for (int i = 0; i < NB_RETRIES; i++) {
                 Result res = htable2.get(get);
          
          Show
          Ted Yu added a comment - I found the following for a hanging TestReplication run in trunk: 2012-08-25 20:43:46,903 WARN [Master:0;sea-lab-0,43158,1345952626654] master.AssignmentManager(1606): Failed assignment of -ROOT-,,0.70236052 to sea-lab-0,60237,1345952626692, trying to assign elsewhere instead; retry=0 org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hbase.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:187) at $Proxy17.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:500) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1587) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1256) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1226) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1221) at org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:2103) at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:785) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:665) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:439) at java.lang. Thread .run( Thread .java:662) Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1766) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1288) at org.apache.hadoop.hbase.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:178) ... 11 more 2012-08-25 20:43:46,903 INFO [Master:0;sea-lab-0,43158,1345952626654] master.RegionStates(250): Region {NAME => '-ROOT-,,0', STARTKEY => '', ENDKEY => '', ENCODED => 70236052,} transitioned from {-ROOT-,,0.70236052 state=PENDING_OPEN, ts=1345952626860, server=sea-lab-0,60237,1345952626692} to {-ROOT-,,0.70236052 state=OFFLINE, ts=1345952626903, server= null } 2012-08-25 20:43:46,903 WARN [Master:0;sea-lab-0,43158,1345952626654] master.AssignmentManager(1772): Can't move the region 70236052, there is no destination server available. 2012-08-25 20:43:46,903 WARN [Master:0;sea-lab-0,43158,1345952626654] master.AssignmentManager(1618): Unable to find a viable location to assign region -ROOT-,,0.70236052 I suggest the following change: Index: hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java =================================================================== --- hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java (revision 1377368) +++ hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java (working copy) @@ -520,7 +520,7 @@ // disable and start the peer admin.disablePeer( "2" ); - utility2.startMiniHBaseCluster(1, 1); + utility2.startMiniHBaseCluster(1, 2); Get get = new Get(rowkey); for ( int i = 0; i < NB_RETRIES; i++) { Result res = htable2.get(get);
          Hide
          Ted Yu added a comment - - edited

          From https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92-security/118/testReport/org.apache.hadoop.hbase.replication/TestReplication/queueFailover/:

          2012-08-28 17:29:54,404 DEBUG [main-EventThread] master.AssignmentManager(2911): based on AM, current region=.META.,,1.1028785192 is on server=juno.apache.org,43891,1346174923071 server being checked: juno.apache.org,55977,1346174923023
          2012-08-28 17:29:54,405 DEBUG [RegionServer:1;juno.apache.org,43891,1346174923071-EventThread] zookeeper.ZooKeeperWatcher(266): regionserver:43891-0x1396e4723930005 Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/1/rs
          2012-08-28 17:29:54,406 DEBUG [main-EventThread] master.ServerManager(394): Added=juno.apache.org,55977,1346174923023 to dead servers, submitted shutdown handler to be executed, root=false, meta=false
          2012-08-28 17:29:54,406 DEBUG [main-EventThread] zookeeper.ZooKeeperWatcher(266): master:55418-0x1396e4723930003 Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/1/rs
          2012-08-28 17:29:54,406 DEBUG [RegionServer:1;juno.apache.org,43891,1346174923071-EventThread] zookeeper.ZKUtil(229): regionserver:43891-0x1396e4723930005 Set watcher on existing znode /1/rs/juno.apache.org,43891,1346174923071
          2012-08-28 17:29:54,407 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] handler.ServerShutdownHandler(175): Splitting logs for juno.apache.org,55977,1346174923023
          ...
          2012-08-28 17:29:54,407 DEBUG [main-EventThread] zookeeper.ZKUtil(229): master:55418-0x1396e4723930003 Set watcher on existing znode /1/rs/juno.apache.org,43891,1346174923071
          2012-08-28 17:29:54,410 DEBUG [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.MasterFileSystem(267): Renamed region directory: hdfs://localhost:59869/user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023-splitting
          2012-08-28 17:29:54,410 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(894): dead splitlog worker juno.apache.org,55977,1346174923023
          2012-08-28 17:29:54,413 DEBUG [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(246): Scheduling batch of logs to split
          2012-08-28 17:29:54,414 INFO  [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(248): started splitting logs in [hdfs://localhost:59869/user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023-splitting]
          ...
          2012-08-28 17:29:55,000 ERROR [IPC Server handler 7 on 59869] security.UserGroupInformation(1124): PriviledgedActionException as:jenkins.hfs.0 cause:java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
          2012-08-28 17:29:55,004 FATAL [RegionServer:0;juno.apache.org,55977,1346174923023.logRoller] regionserver.HRegionServer(1537): ABORTING region server juno.apache.org,55977,1346174923023: IOE in log roller
          java.io.IOException: cannot get log writer
          	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:715)
          	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:662)
          	at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:594)
          	at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
          	at java.lang.Thread.run(Thread.java:662)
          Caused by: java.io.IOException: java.io.FileNotFoundException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
          	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106)
          	at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:712)
          	... 4 more
          Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
          	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
          	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
          	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
          	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
          	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
          	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
          	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3251)
          	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
          	at org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:198)
          	at org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:601)
          	at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:442)
          	at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87)
          	... 5 more
          Caused by: org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1167)
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1241)
          	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1188)
          	at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
          	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:396)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
          	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
          
          	at org.apache.hadoop.ipc.Client.call(Client.java:1070)
          	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
          	at $Proxy8.create(Unknown Source)
          	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
          	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
          	at $Proxy8.create(Unknown Source)
          	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3248)
          	... 13 more
          

          It is clear that log splitting (splitLog() call on master) raced with log roller (on region server).
          In run() of log roller:

                } catch (IOException ex) {
          

          One option is to distinguish FileNotFoundException from other IOE's and exit.

          Show
          Ted Yu added a comment - - edited From https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92-security/118/testReport/org.apache.hadoop.hbase.replication/TestReplication/queueFailover/: 2012-08-28 17:29:54,404 DEBUG [main-EventThread] master.AssignmentManager(2911): based on AM, current region=.META.,,1.1028785192 is on server=juno.apache.org,43891,1346174923071 server being checked: juno.apache.org,55977,1346174923023 2012-08-28 17:29:54,405 DEBUG [RegionServer:1;juno.apache.org,43891,1346174923071-EventThread] zookeeper.ZooKeeperWatcher(266): regionserver:43891-0x1396e4723930005 Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/1/rs 2012-08-28 17:29:54,406 DEBUG [main-EventThread] master.ServerManager(394): Added=juno.apache.org,55977,1346174923023 to dead servers, submitted shutdown handler to be executed, root= false , meta= false 2012-08-28 17:29:54,406 DEBUG [main-EventThread] zookeeper.ZooKeeperWatcher(266): master:55418-0x1396e4723930003 Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/1/rs 2012-08-28 17:29:54,406 DEBUG [RegionServer:1;juno.apache.org,43891,1346174923071-EventThread] zookeeper.ZKUtil(229): regionserver:43891-0x1396e4723930005 Set watcher on existing znode /1/rs/juno.apache.org,43891,1346174923071 2012-08-28 17:29:54,407 INFO [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] handler.ServerShutdownHandler(175): Splitting logs for juno.apache.org,55977,1346174923023 ... 2012-08-28 17:29:54,407 DEBUG [main-EventThread] zookeeper.ZKUtil(229): master:55418-0x1396e4723930003 Set watcher on existing znode /1/rs/juno.apache.org,43891,1346174923071 2012-08-28 17:29:54,410 DEBUG [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.MasterFileSystem(267): Renamed region directory: hdfs: //localhost:59869/user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023-splitting 2012-08-28 17:29:54,410 INFO [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(894): dead splitlog worker juno.apache.org,55977,1346174923023 2012-08-28 17:29:54,413 DEBUG [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(246): Scheduling batch of logs to split 2012-08-28 17:29:54,414 INFO [MASTER_SERVER_OPERATIONS-juno.apache.org,55418,1346174922926-0] master.SplitLogManager(248): started splitting logs in [hdfs: //localhost:59869/user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023-splitting] ... 2012-08-28 17:29:55,000 ERROR [IPC Server handler 7 on 59869] security.UserGroupInformation(1124): PriviledgedActionException as:jenkins.hfs.0 cause:java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023 2012-08-28 17:29:55,004 FATAL [RegionServer:0;juno.apache.org,55977,1346174923023.logRoller] regionserver.HRegionServer(1537): ABORTING region server juno.apache.org,55977,1346174923023: IOE in log roller java.io.IOException: cannot get log writer at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:715) at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:662) at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:594) at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94) at java.lang. Thread .run( Thread .java:662) Caused by: java.io.IOException: java.io.FileNotFoundException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023 at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106) at org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:712) ... 4 more Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3251) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713) at org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:198) at org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:601) at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:442) at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87) ... 5 more Caused by: org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException: Parent directory doesn't exist: /user/jenkins/hbase/.logs/juno.apache.org,55977,1346174923023 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1167) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1241) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1188) at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy8.create(Unknown Source) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy8.create(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3248) ... 13 more It is clear that log splitting (splitLog() call on master) raced with log roller (on region server). In run() of log roller: } catch (IOException ex) { One option is to distinguish FileNotFoundException from other IOE's and exit.
          Hide
          Ted Yu added a comment -

          Reducing replication.sleep.before.failover

          Show
          Ted Yu added a comment - Reducing replication.sleep.before.failover
          Hide
          Ted Yu added a comment -

          Without patch, TestReplication#queueFailover failed on 4th iteration.

          With patch v2, 6 iterations passed.

          Running 100 more iterations.

          Show
          Ted Yu added a comment - Without patch, TestReplication#queueFailover failed on 4th iteration. With patch v2, 6 iterations passed. Running 100 more iterations.
          Hide
          Ted Yu added a comment -

          Test failed on the 4th of the 100 iterations.

          Show
          Ted Yu added a comment - Test failed on the 4th of the 100 iterations.
          Hide
          Devaraj Das added a comment -

          Uploading the two outputs that I had saved (the links in the jira description aren't valid any more). The worrisome part for me is that in both the cases, the replication seems to be incomplete (although the test waited for a fair bit of time). The fact that one RS from each cluster crashes is expected in this test and the test checks to see that replication succeeds even under this situation.

          Show
          Devaraj Das added a comment - Uploading the two outputs that I had saved (the links in the jira description aren't valid any more). The worrisome part for me is that in both the cases, the replication seems to be incomplete (although the test waited for a fair bit of time). The fact that one RS from each cluster crashes is expected in this test and the test checks to see that replication succeeds even under this situation.
          Hide
          Devaraj Das added a comment -

          Test failed on the 4th of the 100 iterations.

          What failure did you see?

          Show
          Devaraj Das added a comment - Test failed on the 4th of the 100 iterations. What failure did you see?
          Hide
          Ted Yu added a comment -

          Failed tests: queueFailover(org.apache.hadoop.hbase.replication.TestReplication): Waited too much time for queueFailover replication. Waited 40364ms.

          Show
          Ted Yu added a comment - Failed tests: queueFailover(org.apache.hadoop.hbase.replication.TestReplication): Waited too much time for queueFailover replication. Waited 40364ms.
          Hide
          stack added a comment -

          Should we disable this flapping test till its figured?

          Show
          stack added a comment - Should we disable this flapping test till its figured?
          Hide
          Ted Yu added a comment -

          I think we should disable this test.

          Show
          Ted Yu added a comment - I think we should disable this test.
          Hide
          Lars Hofhansl added a comment -

          A flapping test is almost worse than a failing test. It adds to the runtime, but does not add confidence to the test run.

          There're some scary comment in there as well:

              // Takes about 20 secs to run the full loading, kill around the middle
              Thread killer1 = killARegionServer(utility1, 7500, rsToKill1);
              Thread killer2 = killARegionServer(utility2, 10000, rsToKill2);
          

          On what machine does it take 20s?
          I'd say we disable it for now.

          Show
          Lars Hofhansl added a comment - A flapping test is almost worse than a failing test. It adds to the runtime, but does not add confidence to the test run. There're some scary comment in there as well: // Takes about 20 secs to run the full loading, kill around the middle Thread killer1 = killARegionServer(utility1, 7500, rsToKill1); Thread killer2 = killARegionServer(utility2, 10000, rsToKill2); On what machine does it take 20s? I'd say we disable it for now.
          Hide
          stack added a comment -

          I think we disable it in 0.92 and perhaps in 0.94 for 0.94.2 (unless someone fixes it meantime). We leave this issue as critical on trunk.

          Show
          stack added a comment - I think we disable it in 0.92 and perhaps in 0.94 for 0.94.2 (unless someone fixes it meantime). We leave this issue as critical on trunk.
          Hide
          Ted Yu added a comment -

          I saw that comment too.
          On my laptop the loading took about 20 seconds.

          Show
          Ted Yu added a comment - I saw that comment too. On my laptop the loading took about 20 seconds.
          Hide
          Devaraj Das added a comment -

          So far it seems like a hdfs issue (somehow there are a couple of missing rows in the replicated data). In one or two days I will post some concrete comments..

          Show
          Devaraj Das added a comment - So far it seems like a hdfs issue (somehow there are a couple of missing rows in the replicated data). In one or two days I will post some concrete comments..
          Hide
          Devaraj Das added a comment -

          After spending some time on debugging what was going on (where I took the failure as in http://bit.ly/RDdmPg as the test failure to debug), seems to me that the problem is due to the way exceptions are handled in ReplicationSource.java. Basically, the replication would fail with exceptions for all entries involved in a particular call to ReplicationSource.readAllEntriesToReplicateOrNextFile, even if the exception were thrown for the tailing entry(s). This is because of multiple calls to reader.next within readAllEntriesToReplicateOrNextFile. If the second call (within the while loop) throws an exception (like EOFException), it basically destroys the work done up until then. Therefore, some rows would never be replicated.

          The patch attached here makes the exception handling so that if there were a exception in the second time, the method would just return (thereby allowing the present call to readAllEntriesToReplicateOrNextFile proceed normally). The following call to readAllEntriesToReplicateOrNextFile would actually throw the exception.

          With this patch, I stopped noticing the failures similar to http://bit.ly/RDdmPg.

          However, I do see some other failures and that I am still debugging (and that's why I renamed this issue to Part-1!)

          Show
          Devaraj Das added a comment - After spending some time on debugging what was going on (where I took the failure as in http://bit.ly/RDdmPg as the test failure to debug), seems to me that the problem is due to the way exceptions are handled in ReplicationSource.java. Basically, the replication would fail with exceptions for all entries involved in a particular call to ReplicationSource.readAllEntriesToReplicateOrNextFile, even if the exception were thrown for the tailing entry(s). This is because of multiple calls to reader.next within readAllEntriesToReplicateOrNextFile. If the second call (within the while loop) throws an exception (like EOFException), it basically destroys the work done up until then. Therefore, some rows would never be replicated. The patch attached here makes the exception handling so that if there were a exception in the second time, the method would just return (thereby allowing the present call to readAllEntriesToReplicateOrNextFile proceed normally). The following call to readAllEntriesToReplicateOrNextFile would actually throw the exception. With this patch, I stopped noticing the failures similar to http://bit.ly/RDdmPg . However, I do see some other failures and that I am still debugging (and that's why I renamed this issue to Part-1!)
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12543752/6649-1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2779//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12543752/6649-1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2779//console This message is automatically generated.
          Hide
          stack added a comment -

          This patch makes sense to me. We replicate all up to the exception and then next time in, we should pick up the IOE again. Want me to commit this DD?

          Show
          stack added a comment - This patch makes sense to me. We replicate all up to the exception and then next time in, we should pick up the IOE again. Want me to commit this DD?
          Hide
          Devaraj Das added a comment -

          Yeah .. although I should submit a patch for trunk as well..

          Show
          Devaraj Das added a comment - Yeah .. although I should submit a patch for trunk as well..
          Hide
          Ted Yu added a comment -

          When I ran DD's patch in trunk, TestReplication still hung.

          Show
          Ted Yu added a comment - When I ran DD's patch in trunk, TestReplication still hung.
          Hide
          Devaraj Das added a comment -

          Ted YuThis patch fixes a specific problem to do with replication missing rows, and in my observations, that leads to somewhat frequent TestReplication.queueFailover failures. On trunk, do you know which test hangs? There probably are more issues to fix in the replication area, and we should have follow up jiras (and this jira is part-1 ).

          Show
          Devaraj Das added a comment - Ted Yu This patch fixes a specific problem to do with replication missing rows, and in my observations, that leads to somewhat frequent TestReplication.queueFailover failures. On trunk, do you know which test hangs? There probably are more issues to fix in the replication area, and we should have follow up jiras (and this jira is part-1 ).
          Hide
          Devaraj Das added a comment -

          Patch for trunk

          Show
          Devaraj Das added a comment - Patch for trunk
          Hide
          Ted Yu added a comment -

          target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplication.txt was 0 length.
          There was no JVM left from TestReplication by the time I got back to computer.

          Show
          Ted Yu added a comment - target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplication.txt was 0 length. There was no JVM left from TestReplication by the time I got back to computer.
          Hide
          Lars Hofhansl added a comment -

          Patch looks good to me.
          (As Ted points out there might other issues as well)

          Show
          Lars Hofhansl added a comment - Patch looks good to me. (As Ted points out there might other issues as well)
          Hide
          Lars Hofhansl added a comment -

          I'd also like this in 0.94. The 0.92 will probably just apply cleanly. If not I'll make one.

          Show
          Lars Hofhansl added a comment - I'd also like this in 0.94. The 0.92 will probably just apply cleanly. If not I'll make one.
          Hide
          Ted Yu added a comment -

          @J-D:
          What do you think ?

          nit:

          +      } catch (IOException ie) {
          +        break;
          

          A log statement is desirable before break.

          Show
          Ted Yu added a comment - @J-D: What do you think ? nit: + } catch (IOException ie) { + break ; A log statement is desirable before break.
          Hide
          Himanshu Vashishtha added a comment -

          lgtm.
          The exception will be re-thrown in the next try, so +0 on adding a log statement before break.

          Show
          Himanshu Vashishtha added a comment - lgtm. The exception will be re-thrown in the next try, so +0 on adding a log statement before break.
          Hide
          stack added a comment -

          J-D on vacation. Let me commit this. Will add the log message Ted suggests though my sense it overkill, lets see. Would suggest new issue for other 'parts' DD.

          Show
          stack added a comment - J-D on vacation. Let me commit this. Will add the log message Ted suggests though my sense it overkill, lets see. Would suggest new issue for other 'parts' DD.
          Hide
          Devaraj Das added a comment -

          Don't mind adding a few comments around the exception handling..

          Show
          Devaraj Das added a comment - Don't mind adding a few comments around the exception handling..
          Hide
          stack added a comment -

          Committed to trunk, 0.92, and 0.94. Thanks for the reviews lads and DD for the patch.

          Show
          stack added a comment - Committed to trunk, 0.92, and 0.94. Thanks for the reviews lads and DD for the patch.
          Hide
          stack added a comment -

          Here is what I applied. Includes Ted's suggested logging. I applied this same patch to 0.94 and 0.92 w/ -p1

          Show
          stack added a comment - Here is what I applied. Includes Ted's suggested logging. I applied this same patch to 0.94 and 0.92 w/ -p1
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #450 (See https://builds.apache.org/job/HBase-0.94/450/)
          HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381289)

          Result = FAILURE
          stack :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #450 (See https://builds.apache.org/job/HBase-0.94/450/ ) HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381289) Result = FAILURE stack : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #3307 (See https://builds.apache.org/job/HBase-TRUNK/3307/)
          HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381287)

          Result = FAILURE
          stack :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #3307 (See https://builds.apache.org/job/HBase-TRUNK/3307/ ) HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381287) Result = FAILURE stack : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #557 (See https://builds.apache.org/job/HBase-0.92/557/)
          HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381291)

          Result = SUCCESS
          stack :
          Files :

          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Show
          Hudson added a comment - Integrated in HBase-0.92 #557 (See https://builds.apache.org/job/HBase-0.92/557/ ) HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381291) Result = SUCCESS stack : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Hide
          Jean-Daniel Cryans added a comment -

          This is because of multiple calls to reader.next within readAllEntriesToReplicateOrNextFile. If the second call (within the while loop) throws an exception (like EOFException), it basically destroys the work done up until then. Therefore, some rows would never be replicated.

          The position in the log is updated in ZK only once the edits are replicated hence, even if you fail on the second or hundredth edit, the next region server that will be in charge of that log will pick up where the previous RS was (even if that means re-reading some edits).

          Show
          Jean-Daniel Cryans added a comment - This is because of multiple calls to reader.next within readAllEntriesToReplicateOrNextFile. If the second call (within the while loop) throws an exception (like EOFException), it basically destroys the work done up until then. Therefore, some rows would never be replicated. The position in the log is updated in ZK only once the edits are replicated hence, even if you fail on the second or hundredth edit, the next region server that will be in charge of that log will pick up where the previous RS was (even if that means re-reading some edits).
          Hide
          Devaraj Das added a comment -

          The problem happened with a recovered log file.. (another RS was trying to replicate files of a previously crashed RS).

          The problem here is that the method reads some rows but loses them due to an exception eventually. Look for the lines with the string

          vesta.apache.org%2C57779%2C1345217521341.1345217601487

          in the file http://bit.ly/RDdmPg. You will see a bunch of lines like:

          java.io.EOFException: hdfs://localhost:60044/user/hudson/hbase/.oldlogs/vesta.apache.org%2C57779%2C1345217521341.1345217601487, entryStart=40929, pos=40960, end=40960, edit=3
          	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
          	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
          	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
          	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
          	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.addFileInfoToException(SequenceFileLogReader.java:252)
          	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:208)
          	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:427)
          	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:306)
          

          Unless I have missed something, here the problem seems to have been caused by the fact that the second call to reader.next in the method readAllEntriesToReplicateOrNextFile fails (please let me know if you need more details).

          Show
          Devaraj Das added a comment - The problem happened with a recovered log file.. (another RS was trying to replicate files of a previously crashed RS). The problem here is that the method reads some rows but loses them due to an exception eventually. Look for the lines with the string vesta.apache.org%2C57779%2C1345217521341.1345217601487 in the file http://bit.ly/RDdmPg . You will see a bunch of lines like: java.io.EOFException: hdfs://localhost:60044/user/hudson/hbase/.oldlogs/vesta.apache.org%2C57779%2C1345217521341.1345217601487, entryStart=40929, pos=40960, end=40960, edit=3 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.addFileInfoToException(SequenceFileLogReader.java:252) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:208) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:427) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:306) Unless I have missed something, here the problem seems to have been caused by the fact that the second call to reader.next in the method readAllEntriesToReplicateOrNextFile fails (please let me know if you need more details).
          Hide
          Jean-Daniel Cryans added a comment -

          Oh I see what you mean. Very good find! I wonder what's that gibberish at the end of the file.

          Show
          Jean-Daniel Cryans added a comment - Oh I see what you mean. Very good find! I wonder what's that gibberish at the end of the file.
          Hide
          Devaraj Das added a comment -

          Oh I see what you mean. Very good find! I wonder what's that gibberish at the end of the file.

          Thanks! Are you referring to the log file? I see the following at the end (no gibberish):

          2012-08-17 15:35:01,161 DEBUG [RegionServer:1;vesta.apache.org,40480,1345217521368-EventThread.replicationSource,2] regionserver.ReplicationSource(474): Opening log for replication vesta.apache.org%2C40480%2C1345217521368.1345217648386 at 258
          2012-08-17 15:35:01,164 DEBUG [RegionServer:1;vesta.apache.org,40480,1345217521368-EventThread.replicationSource,2] regionserver.ReplicationSource(429): currentNbOperations:13022 and seenEntries:0 and size: 0
          2012-08-17 15:35:01,164 DEBUG [RegionServer:1;vesta.apache.org,40480,1345217521368-EventThread.replicationSource,2] regionserver.ReplicationSource(549): Nothing to replicate, sleeping 100 times 10
          
          Show
          Devaraj Das added a comment - Oh I see what you mean. Very good find! I wonder what's that gibberish at the end of the file. Thanks! Are you referring to the log file? I see the following at the end (no gibberish): 2012-08-17 15:35:01,161 DEBUG [RegionServer:1;vesta.apache.org,40480,1345217521368-EventThread.replicationSource,2] regionserver.ReplicationSource(474): Opening log for replication vesta.apache.org%2C40480%2C1345217521368.1345217648386 at 258 2012-08-17 15:35:01,164 DEBUG [RegionServer:1;vesta.apache.org,40480,1345217521368-EventThread.replicationSource,2] regionserver.ReplicationSource(429): currentNbOperations:13022 and seenEntries:0 and size: 0 2012-08-17 15:35:01,164 DEBUG [RegionServer:1;vesta.apache.org,40480,1345217521368-EventThread.replicationSource,2] regionserver.ReplicationSource(549): Nothing to replicate, sleeping 100 times 10
          Hide
          Jean-Daniel Cryans added a comment -

          What I meant is that the reader gets this 10 times:

          java.io.EOFException: hdfs://localhost:60044/user/hudson/hbase/.oldlogs/vesta.apache.org%2C57779%2C1345217521341.1345217601487, entryStart=40929, pos=40960, end=40960, edit=3
          

          So if I'm reading this correctly it's able to read the file and got 3 edits but gets an EOF. Is something half written? Then it gives up on the file:

          2012-08-17 15:33:50,099 INFO  [ReplicationExecutor-0.replicationSource,2-vesta.apache.org,57779,1345217521341] regionserver.ReplicationSourceManager(352): Done with the recovered queue 2-vesta.apache.org,57779,1345217521341
          

          And there's data loss.

          Show
          Jean-Daniel Cryans added a comment - What I meant is that the reader gets this 10 times: java.io.EOFException: hdfs://localhost:60044/user/hudson/hbase/.oldlogs/vesta.apache.org%2C57779%2C1345217521341.1345217601487, entryStart=40929, pos=40960, end=40960, edit=3 So if I'm reading this correctly it's able to read the file and got 3 edits but gets an EOF. Is something half written? Then it gives up on the file: 2012-08-17 15:33:50,099 INFO [ReplicationExecutor-0.replicationSource,2-vesta.apache.org,57779,1345217521341] regionserver.ReplicationSourceManager(352): Done with the recovered queue 2-vesta.apache.org,57779,1345217521341 And there's data loss.
          Hide
          Devaraj Das added a comment -

          This log file belongs to a crashed RS, and yes, it seems like the last record wasn't completely written to the file before the RS crashed. That should be fine, i.e., no dataloss should happen - in the queueFailover test, the client would have got exceptions to the flushCommit call and it would have retried the batch of 'put' and the corresponding records would have ended up in another RS.

          Show
          Devaraj Das added a comment - This log file belongs to a crashed RS, and yes, it seems like the last record wasn't completely written to the file before the RS crashed. That should be fine, i.e., no dataloss should happen - in the queueFailover test, the client would have got exceptions to the flushCommit call and it would have retried the batch of 'put' and the corresponding records would have ended up in another RS.
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #52 (See https://builds.apache.org/job/HBase-0.94-security/52/)
          HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381289)

          Result = SUCCESS
          stack :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security #52 (See https://builds.apache.org/job/HBase-0.94-security/52/ ) HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381289) Result = SUCCESS stack : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Show
          Lars Hofhansl added a comment - Just failed again: https://builds.apache.org/job/PreCommit-HBASE-Build/2852//testReport/
          Hide
          Jean-Daniel Cryans added a comment -

          We applied this patch on a cluster that replicates and about all the nodes stopped replicated after some time. This is what I see in the logs:

          2012-09-17 20:04:08,111 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r3s24%2C10304%2C1347911704238.1347911706318 at 78617132
          2012-09-17 20:04:08,120 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Break on IOE: hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318, entryStart=78641557, pos=78771200, end=78771200, edit=84
          2012-09-17 20:04:08,120 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:164529 and seenEntries:84 and size: 154068
          2012-09-17 20:04:08,120 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating 84
          2012-09-17 20:04:08,146 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #va1r3s24%2C10304%2C1347911704238.1347911706318 for position 78771200 in hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318
          2012-09-17 20:04:08,158 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Removing 0 logs in the list: []
          2012-09-17 20:04:08,158 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicated in total: 93234
          2012-09-17 20:04:08,158 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r3s24%2C10304%2C1347911704238.1347911706318 at 78771200
          2012-09-17 20:04:08,163 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unexpected exception in ReplicationSource, currentPath=hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318
          java.lang.IndexOutOfBoundsException
                  at java.io.DataInputStream.readFully(DataInputStream.java:175)
                  at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
                  at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
                  at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2001)
                  at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1901)
                  at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1947)
                  at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:235)
                  at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:394)
                  at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:307)
          

          The file is still in HDFS and it's about double the size we see up there, so it wasn't the end of the file. Looking at other nodes, we always get "Break on IOE" before getting the exception that kills replication. This is why I think that this patch is the issue. Somehow reading up to the end is reading too far.

          We need to fix or backport.

          Show
          Jean-Daniel Cryans added a comment - We applied this patch on a cluster that replicates and about all the nodes stopped replicated after some time. This is what I see in the logs: 2012-09-17 20:04:08,111 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r3s24%2C10304%2C1347911704238.1347911706318 at 78617132 2012-09-17 20:04:08,120 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Break on IOE: hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318, entryStart=78641557, pos=78771200, end=78771200, edit=84 2012-09-17 20:04:08,120 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:164529 and seenEntries:84 and size: 154068 2012-09-17 20:04:08,120 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating 84 2012-09-17 20:04:08,146 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #va1r3s24%2C10304%2C1347911704238.1347911706318 for position 78771200 in hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318 2012-09-17 20:04:08,158 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Removing 0 logs in the list: [] 2012-09-17 20:04:08,158 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicated in total: 93234 2012-09-17 20:04:08,158 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r3s24%2C10304%2C1347911704238.1347911706318 at 78771200 2012-09-17 20:04:08,163 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unexpected exception in ReplicationSource, currentPath=hdfs://va1r5s41:10101/va1-backup/.logs/va1r3s24,10304,1347911704238/va1r3s24%2C10304%2C1347911704238.1347911706318 java.lang.IndexOutOfBoundsException at java.io.DataInputStream.readFully(DataInputStream.java:175) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2001) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1901) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1947) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:235) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:394) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:307) The file is still in HDFS and it's about double the size we see up there, so it wasn't the end of the file. Looking at other nodes, we always get "Break on IOE" before getting the exception that kills replication. This is why I think that this patch is the issue. Somehow reading up to the end is reading too far. We need to fix or backport.
          Hide
          Lars Hofhansl added a comment - - edited

          You mean fix or rollback (the change)?

          Show
          Lars Hofhansl added a comment - - edited You mean fix or rollback (the change)?
          Hide
          Devaraj Das added a comment -

          Looking at the logs/patch more closely.. Will get back soon.

          Show
          Devaraj Das added a comment - Looking at the logs/patch more closely.. Will get back soon.
          Hide
          Jean-Daniel Cryans added a comment -

          Lars Hofhansl Trying to figure out what the problem is first although if we're in a hurry we can just rollback. (not backport, doh!)

          Show
          Jean-Daniel Cryans added a comment - Lars Hofhansl Trying to figure out what the problem is first although if we're in a hurry we can just rollback. (not backport, doh!)
          Hide
          Jean-Daniel Cryans added a comment -

          Devaraj Das I'm still trying to figure out exactly how we get the IndexOutOfBoundsException (I'd say the file didn't get new data and we started reading exactly at the end and the DFSClient doesn't like that? Or it's missing something at the end?), but if it's a case of reading the tail of a recovered log then we could add a check like this:

                try {
                  entry = this.reader.next(entriesArray[currentNbEntries]);
                } catch (IOException ie) {
                  if (queueRecovered) {
                    LOG.debug("Break on IOE: " + ie.getMessage());
                    break;
                  } else {
                    throw ie;
                  }
                }
          
          Show
          Jean-Daniel Cryans added a comment - Devaraj Das I'm still trying to figure out exactly how we get the IndexOutOfBoundsException (I'd say the file didn't get new data and we started reading exactly at the end and the DFSClient doesn't like that? Or it's missing something at the end?), but if it's a case of reading the tail of a recovered log then we could add a check like this: try { entry = this .reader.next(entriesArray[currentNbEntries]); } catch (IOException ie) { if (queueRecovered) { LOG.debug( "Break on IOE: " + ie.getMessage()); break ; } else { throw ie; } }
          Hide
          Jean-Daniel Cryans added a comment -

          But now that I think about it, it may crap out when coming back to read even on a recovered file. The data will all make it to the other cluster but that source will never be fully cleaned up.

          Which leads me to think that this is a bug in DFSClient. It's expecting something it's not getting.

          Show
          Jean-Daniel Cryans added a comment - But now that I think about it, it may crap out when coming back to read even on a recovered file. The data will all make it to the other cluster but that source will never be fully cleaned up. Which leads me to think that this is a bug in DFSClient. It's expecting something it's not getting.
          Hide
          Devaraj Das added a comment -

          Yeah, Jean-Daniel Cryans not sure how one could get a IndexOutOfBounds exception. I can't see how the patch would make it surface as well .. The patch only catches and ignores IOE (as opposed to all exceptions).. But yeah give me another hour please. Let me dig some more.

          Show
          Devaraj Das added a comment - Yeah, Jean-Daniel Cryans not sure how one could get a IndexOutOfBounds exception. I can't see how the patch would make it surface as well .. The patch only catches and ignores IOE (as opposed to all exceptions).. But yeah give me another hour please. Let me dig some more.
          Hide
          Devaraj Das added a comment -

          Has there been any change in your cluster environment (hadoop version, etc. using different version of dfs client causing the issue to surface)? [Not sure which hadoop version you are on, but there is no chance you are hitting HDFS-1108, right?]

          Show
          Devaraj Das added a comment - Has there been any change in your cluster environment (hadoop version, etc. using different version of dfs client causing the issue to surface)? [Not sure which hadoop version you are on, but there is no chance you are hitting HDFS-1108, right?]
          Hide
          Devaraj Das added a comment -

          Okay a plausible explanation -
          1. ReplicationSource.readAllEntriesToReplicateOrNextFile throws an IOException (which causes the log "Break on IOE:" to print), but ignores the exception.
          2. When readAllEntriesToReplicateOrNextFile returns, the reader's file-pointer position is queried and 'this.position' is set to that (the reader's file-pointer is possibly pointing to gibberish)
          3. Eventually, readAllEntriesToReplicateOrNextFile gets called again, and this time this.reader.next inside throws IndexOutOfBounds exception because it read gibberish (looking at the code of DataInputStream.java, it seems like one of the cases where the IndexOutOfBounds is thrown is when the length passed to readFully is less than 0).

          The fix I can think of is to reset the reader's 'position' to the last valid position (upon return from the method readAllEntriesToReplicateOrNextFile).

          Thoughts on the above? Does the analysis make sense?

          Show
          Devaraj Das added a comment - Okay a plausible explanation - 1. ReplicationSource.readAllEntriesToReplicateOrNextFile throws an IOException (which causes the log "Break on IOE:" to print), but ignores the exception. 2. When readAllEntriesToReplicateOrNextFile returns, the reader's file-pointer position is queried and 'this.position' is set to that (the reader's file-pointer is possibly pointing to gibberish) 3. Eventually, readAllEntriesToReplicateOrNextFile gets called again, and this time this.reader.next inside throws IndexOutOfBounds exception because it read gibberish (looking at the code of DataInputStream.java, it seems like one of the cases where the IndexOutOfBounds is thrown is when the length passed to readFully is less than 0). The fix I can think of is to reset the reader's 'position' to the last valid position (upon return from the method readAllEntriesToReplicateOrNextFile). Thoughts on the above? Does the analysis make sense?
          Hide
          Devaraj Das added a comment -

          This patch demonstrates what I commented with earlier. Please have a look. I could make a method which has the getPosition() and next().. but I wanted to check on whether folks agree with the fix first.

          Show
          Devaraj Das added a comment - This patch demonstrates what I commented with earlier. Please have a look. I could make a method which has the getPosition() and next().. but I wanted to check on whether folks agree with the fix first.
          Hide
          Lars Hofhansl added a comment -

          Should we pull HBASE-6719 into this?

          Show
          Lars Hofhansl added a comment - Should we pull HBASE-6719 into this?
          Hide
          Jean-Daniel Cryans added a comment -

          The patch only catches and ignores IOE (as opposed to all exceptions)

          What it does do is permitting to read up to the end of the file.

          [Not sure which hadoop version you are on, but there is no chance you are hitting HDFS-1108, right?]

          We are on CDH3u3, didn't change when we applied the patch.

          Okay a plausible explanation -

          It's plausible but unless we really understand what that "gibberish" is at the end of the file, we can't truly make a fix. I don't know why that IOE is throw out but normally we just silently finish reading from the file. There is some special case here.

          Should we pull HBASE-6719 into this?

          I think it's separate issues.

          Show
          Jean-Daniel Cryans added a comment - The patch only catches and ignores IOE (as opposed to all exceptions) What it does do is permitting to read up to the end of the file. [Not sure which hadoop version you are on, but there is no chance you are hitting HDFS-1108, right?] We are on CDH3u3, didn't change when we applied the patch. Okay a plausible explanation - It's plausible but unless we really understand what that "gibberish" is at the end of the file, we can't truly make a fix. I don't know why that IOE is throw out but normally we just silently finish reading from the file. There is some special case here. Should we pull HBASE-6719 into this? I think it's separate issues.
          Hide
          Lars Hofhansl added a comment -

          I say we revert from 0.94.2 and retry in 0.94.3.

          Although from DD's comment:

          If the second call (within the while loop) throws an exception (like EOFException), it basically destroys the work done up until then. Therefore, some rows would never be replicated.

          This would be a dataloss issue without the fix.

          I find that a bit confusion. Since J-D saw the ignored exception in the test cluster eventually on all machines, it seems there was data lost in all versions before 0.94.2? That seems very unlikely.

          Show
          Lars Hofhansl added a comment - I say we revert from 0.94.2 and retry in 0.94.3. Although from DD's comment: If the second call (within the while loop) throws an exception (like EOFException), it basically destroys the work done up until then. Therefore, some rows would never be replicated. This would be a dataloss issue without the fix. I find that a bit confusion. Since J-D saw the ignored exception in the test cluster eventually on all machines, it seems there was data lost in all versions before 0.94.2? That seems very unlikely.
          Hide
          Devaraj Das added a comment -

          Attaching a more complete fix (for 0.94)

          Jean-Daniel Cryans, could you please try this patch out in your cluster.

          The more I think about it, the more I am beginning to believe that setting the position so that it always points to a valid location is the fix here...

          Lars Hofhansl I have seen dataloss issues (via the unit test) without this patch..

          Show
          Devaraj Das added a comment - Attaching a more complete fix (for 0.94) Jean-Daniel Cryans , could you please try this patch out in your cluster. The more I think about it, the more I am beginning to believe that setting the position so that it always points to a valid location is the fix here... Lars Hofhansl I have seen dataloss issues (via the unit test) without this patch..
          Hide
          Jean-Daniel Cryans added a comment -

          This would be a dataloss issue without the fix.

          I have seen dataloss issues (via the unit test) without this patch..

          FWIW if there was indeed dataloss caused by this, it would have been when recovering logs. During normal operation that exception was retried until we're able to read the file.

          could you please try this patch out in your cluster.

          It's not exactly a test cluster, more like prod-ish, so I'll put it on only one machine. I assume it might take the whole day to hit the condition.

          Show
          Jean-Daniel Cryans added a comment - This would be a dataloss issue without the fix. I have seen dataloss issues (via the unit test) without this patch.. FWIW if there was indeed dataloss caused by this, it would have been when recovering logs. During normal operation that exception was retried until we're able to read the file. could you please try this patch out in your cluster. It's not exactly a test cluster, more like prod-ish, so I'll put it on only one machine. I assume it might take the whole day to hit the condition.
          Hide
          Devaraj Das added a comment -

          Thanks, JD

          Show
          Devaraj Das added a comment - Thanks, JD
          Hide
          Jean-Daniel Cryans added a comment -

          The server that has the patch did a "Break on IOE" twice, and it seems to work:

          2012-09-19 21:26:50,104 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r6s44%2C10304%2C1348088378534.1348089931722 at 21992487
          2012-09-19 21:26:50,110 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Break on IOE: hdfs://va1r5s41:10101/va1-backup/.logs/va1r6s44,10304,1348088378534/va1r6s44%2C10304%2C1348088378534.1348089931722, entryStart=21993911, pos=22058496, end=22058496, edit=5
          2012-09-19 21:26:50,110 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:783007 and seenEntries:5 and size: 64585
          2012-09-19 21:26:50,110 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating 5
          2012-09-19 21:26:50,119 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #va1r6s44%2C10304%2C1348088378534.1348089931722 for position 21993911 in hdfs://va1r5s41:10101/va1-backup/.logs/va1r6s44,10304,1348088378534/va1r6s44%2C10304%2C1348088378534.1348089931722
          2012-09-19 21:26:50,129 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Removing 0 logs in the list: []
          2012-09-19 21:26:50,129 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicated in total: 145502
          2012-09-19 21:26:50,129 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r6s44%2C10304%2C1348088378534.1348089931722 at 21993911
          

          One thing that I saw that this patch breaks is the size in "currentNbOperations:783007 and seenEntries:5 and size: 64585" because it relies on this.position being the position at the beginning. I often see that number at 0 while having edits to replicate. It's minor since in HBASE-6804 I'm removing that log message altogether but we may want to either remove the size or keep track of what it is at the beginning of the loop within the context of this jira.

          Show
          Jean-Daniel Cryans added a comment - The server that has the patch did a "Break on IOE" twice, and it seems to work: 2012-09-19 21:26:50,104 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r6s44%2C10304%2C1348088378534.1348089931722 at 21992487 2012-09-19 21:26:50,110 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Break on IOE: hdfs://va1r5s41:10101/va1-backup/.logs/va1r6s44,10304,1348088378534/va1r6s44%2C10304%2C1348088378534.1348089931722, entryStart=21993911, pos=22058496, end=22058496, edit=5 2012-09-19 21:26:50,110 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:783007 and seenEntries:5 and size: 64585 2012-09-19 21:26:50,110 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating 5 2012-09-19 21:26:50,119 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #va1r6s44%2C10304%2C1348088378534.1348089931722 for position 21993911 in hdfs://va1r5s41:10101/va1-backup/.logs/va1r6s44,10304,1348088378534/va1r6s44%2C10304%2C1348088378534.1348089931722 2012-09-19 21:26:50,129 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Removing 0 logs in the list: [] 2012-09-19 21:26:50,129 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicated in total: 145502 2012-09-19 21:26:50,129 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication va1r6s44%2C10304%2C1348088378534.1348089931722 at 21993911 One thing that I saw that this patch breaks is the size in "currentNbOperations:783007 and seenEntries:5 and size: 64585" because it relies on this.position being the position at the beginning. I often see that number at 0 while having edits to replicate. It's minor since in HBASE-6804 I'm removing that log message altogether but we may want to either remove the size or keep track of what it is at the beginning of the loop within the context of this jira.
          Hide
          Devaraj Das added a comment -

          Good to know, JD. I'll submit a patch with the logging addressed in a bit.

          Show
          Devaraj Das added a comment - Good to know, JD. I'll submit a patch with the logging addressed in a bit.
          Hide
          Devaraj Das added a comment -

          Attaching a patch with the 'position' fix.

          Show
          Devaraj Das added a comment - Attaching a patch with the 'position' fix.
          Hide
          Devaraj Das added a comment -

          Same patch, for trunk.

          Show
          Devaraj Das added a comment - Same patch, for trunk.
          Hide
          Lars Hofhansl added a comment -

          +1 on last patch.

          Show
          Lars Hofhansl added a comment - +1 on last patch.
          Hide
          Lars Hofhansl added a comment -

          J-D, any objections to committing this?

          Show
          Lars Hofhansl added a comment - J-D, any objections to committing this?
          Hide
          Jean-Daniel Cryans added a comment -

          I'm going to create a new jira first (should have done that when I found that problem) and post the patches there with a small nit fixed.

          Show
          Jean-Daniel Cryans added a comment - I'm going to create a new jira first (should have done that when I found that problem) and post the patches there with a small nit fixed.
          Hide
          Jean-Daniel Cryans added a comment -

          Re-closing, I opened HBASE-6847.

          Show
          Jean-Daniel Cryans added a comment - Re-closing, I opened HBASE-6847 .
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK #3360 (See https://builds.apache.org/job/HBase-TRUNK/3360/)
          HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388161)

          Result = FAILURE
          jdcryans :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK #3360 (See https://builds.apache.org/job/HBase-TRUNK/3360/ ) HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388161) Result = FAILURE jdcryans : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94 #476 (See https://builds.apache.org/job/HBase-0.94/476/)
          HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388160)

          Result = FAILURE
          jdcryans :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Show
          Hudson added a comment - Integrated in HBase-0.94 #476 (See https://builds.apache.org/job/HBase-0.94/476/ ) HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388160) Result = FAILURE jdcryans : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92 #583 (See https://builds.apache.org/job/HBase-0.92/583/)
          HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388159)
          Fixing the CHANGES.txt after 0.92.2's release and adding HBASE-6649 (Revision 1388157)

          Result = SUCCESS
          jdcryans :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

          jdcryans :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          Show
          Hudson added a comment - Integrated in HBase-0.92 #583 (See https://builds.apache.org/job/HBase-0.92/583/ ) HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388159) Fixing the CHANGES.txt after 0.92.2's release and adding HBASE-6649 (Revision 1388157) Result = SUCCESS jdcryans : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java jdcryans : Files : /hbase/branches/0.92/CHANGES.txt
          Hide
          Hudson added a comment -

          Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #184 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/184/)
          HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388161)

          Result = FAILURE
          jdcryans :
          Files :

          • /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Show
          Hudson added a comment - Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #184 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/184/ ) HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388161) Result = FAILURE jdcryans : Files : /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security #53 (See https://builds.apache.org/job/HBase-0.94-security/53/)
          HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388160)

          Result = SUCCESS
          jdcryans :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security #53 (See https://builds.apache.org/job/HBase-0.94-security/53/ ) HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388160) Result = SUCCESS jdcryans : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.94-security-on-Hadoop-23 #8 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/8/)
          HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388160)
          HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381289)

          Result = FAILURE
          jdcryans :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

          stack :
          Files :

          • /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Show
          Hudson added a comment - Integrated in HBase-0.94-security-on-Hadoop-23 #8 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/8/ ) HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388160) HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381289) Result = FAILURE jdcryans : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java stack : Files : /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Hide
          Hudson added a comment -

          Integrated in HBase-0.92-security #143 (See https://builds.apache.org/job/HBase-0.92-security/143/)
          HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388159)
          Fixing the CHANGES.txt after 0.92.2's release and adding HBASE-6649 (Revision 1388157)
          HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381291)

          Result = FAILURE
          jdcryans :
          Files :

          • /hbase/branches/0.92/CHANGES.txt
          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

          jdcryans :
          Files :

          • /hbase/branches/0.92/CHANGES.txt

          stack :
          Files :

          • /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Show
          Hudson added a comment - Integrated in HBase-0.92-security #143 (See https://builds.apache.org/job/HBase-0.92-security/143/ ) HBASE-6847 HBASE-6649 broke replication (Devaraj Das via JD) (Revision 1388159) Fixing the CHANGES.txt after 0.92.2's release and adding HBASE-6649 (Revision 1388157) HBASE-6649 [0.92 UNIT TESTS] TestReplication.queueFailover occasionally fails [Part-1] (Revision 1381291) Result = FAILURE jdcryans : Files : /hbase/branches/0.92/CHANGES.txt /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java jdcryans : Files : /hbase/branches/0.92/CHANGES.txt stack : Files : /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
          Hide
          stack added a comment -

          Fix up after bulk move overwrote some 0.94.2 fix versions w/ 0.95.0 (Noticed by Lars Hofhansl)

          Show
          stack added a comment - Fix up after bulk move overwrote some 0.94.2 fix versions w/ 0.95.0 (Noticed by Lars Hofhansl)

            People

            • Assignee:
              Devaraj Das
              Reporter:
              Devaraj Das
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development