HBase
  1. HBase
  2. HBASE-9338

Test Big Linked List fails on Hadoop 2.1.0

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.96.0
    • Fix Version/s: 0.98.0, 0.96.0
    • Component/s: test
    • Labels:
      None
    1. HBASE-9338-TESTING-3.patch
      41 kB
      Elliott Clark
    2. HBASE-9338-TESTING-2.patch
      39 kB
      Elliott Clark
    3. HBASE-9338-1.patch
      4 kB
      Elliott Clark
    4. HBASE-9338-0.patch
      4 kB
      Elliott Clark

      Issue Links

        Activity

        Hide
        Devaraj Das added a comment -

        Elliott Clark says the following in a comment on this jira.

        So I ran a job where inside of the generation job I run a get right after the puts. This was SLOW. Like really really really slow. 10x slower. What has happened to our get speed ?

        Wondering if this was investigated further.

        Show
        Devaraj Das added a comment - Elliott Clark says the following in a comment on this jira. So I ran a job where inside of the generation job I run a get right after the puts. This was SLOW. Like really really really slow. 10x slower. What has happened to our get speed ? Wondering if this was investigated further.
        Hide
        stack added a comment -

        Reresolving. The committed patch makes this test pass again by avoiding the issue. Some fixes have gone into fix the issue and Jimmy is on it over in hbase-9514. So reresolving this one. hbase-9514 reenables the test to its old state. Can make that a blocker if we find stuff still broke.

        Show
        stack added a comment - Reresolving. The committed patch makes this test pass again by avoiding the issue. Some fixes have gone into fix the issue and Jimmy is on it over in hbase-9514. So reresolving this one. hbase-9514 reenables the test to its old state. Can make that a blocker if we find stuff still broke.
        Hide
        Elliott Clark added a comment -

        This patch only reduces the probably of running into HBASE-9514, I am afraid.

        Yep, really this patch was just to get us farther into testing 0.96 rc's.

        Show
        Elliott Clark added a comment - This patch only reduces the probably of running into HBASE-9514 , I am afraid. Yep, really this patch was just to get us farther into testing 0.96 rc's.
        Hide
        Jimmy Xiang added a comment -

        I will revert some changes of this patch as a test to make sure the patch to HBASE-9514 is good.

        Show
        Jimmy Xiang added a comment - I will revert some changes of this patch as a test to make sure the patch to HBASE-9514 is good.
        Hide
        Enis Soztutar added a comment -

        Hmm, the old jars were in the classpath it seems. But regardless, the merge and balancer is still doing the region moves. This patch only reduces the probably of running into HBASE-9514, I am afraid. Plus, once HBASE-9514 is in, I propose we revert this patch to test properly.

        Show
        Enis Soztutar added a comment - Hmm, the old jars were in the classpath it seems. But regardless, the merge and balancer is still doing the region moves. This patch only reduces the probably of running into HBASE-9514 , I am afraid. Plus, once HBASE-9514 is in, I propose we revert this patch to test properly.
        Hide
        Elliott Clark added a comment -

        Enis Soztutar did that run have this patch in it?

        13/09/13 03:39:54 INFO actions.Action: Killing region server:hor8n10,60020,1379043329928
        13/09/13 03:39:56 INFO actions.Action: Killed region server:hor8n10,60020,1379043329928. Reported num of rs:8
        13/09/13 03:39:56 INFO actions.Action: Sleeping for:60000
        13/09/13 03:40:05 INFO actions.Action: Performing action: Move random region of table IntegrationTestBigLinkedList
        

        So there's only 11 seconds in between the kill and the move; even though the chaos monkey thread should be sleeping for 60 seconds. After this issue the move should always sleep 20 seconds before moving a region. And it shouldn't happen in parallel with a kill.

        Show
        Elliott Clark added a comment - Enis Soztutar did that run have this patch in it? 13/09/13 03:39:54 INFO actions.Action: Killing region server:hor8n10,60020,1379043329928 13/09/13 03:39:56 INFO actions.Action: Killed region server:hor8n10,60020,1379043329928. Reported num of rs:8 13/09/13 03:39:56 INFO actions.Action: Sleeping for :60000 13/09/13 03:40:05 INFO actions.Action: Performing action: Move random region of table IntegrationTestBigLinkedList So there's only 11 seconds in between the kill and the move; even though the chaos monkey thread should be sleeping for 60 seconds. After this issue the move should always sleep 20 seconds before moving a region. And it shouldn't happen in parallel with a kill.
        Hide
        Enis Soztutar added a comment -

        We have confirmed that the data loss is due to the log splitting + moving region problem:
        CM kills an RS (hor8n10) and then moves a region (d353d3dc87562957ef5c74559ccb48ca)

        13/09/13 03:39:54 INFO actions.Action: Performing action: Restart random region server
        13/09/13 03:39:54 INFO actions.Action: Killing region server:hor8n10,60020,1379043329928
        13/09/13 03:39:56 INFO actions.Action: Killed region server:hor8n10,60020,1379043329928. Reported num of rs:8
        13/09/13 03:39:56 INFO actions.Action: Sleeping for:60000
        13/09/13 03:40:05 INFO actions.Action: Performing action: Move random region of table IntegrationTestBigLinkedList
        

        The master schedules log splitting and carries it on:

        2013-09-13 03:39:57,718 INFO  [MASTER_SERVER_OPERATIONS-hor8n02:60000-1] master.SplitLogManager: started splitting 6 logs in [hdfs://hor8n01:8020/apps/hbase/data/WALs/hor8n10,60020,1379043329928-splitting]
        

        In the mean time, CM action for move region kicks in and causes to unassign the region, and assign it to hor8n01

        2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000] master.HMaster: Client=hrt_qa//68.142.245.212 unassign IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca. in current location if it is online and reassign.force=false
        2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000] master.AssignmentManager: Starting unassign of IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca. (offlining)
        2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000] zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Creating unassigned node d353d3dc87562957ef5c74559ccb48ca in a CLOSING state
        2013-09-13 03:40:06,044 INFO  [RpcServer.handler=44,port=60000] master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca state=OPEN, ts=1379043386591, server=hor8n10,60020,1379043329928} to {d353d3dc87562957ef5c74559ccb48ca state=PENDING_CLOSE, ts=1379043606044, server=hor8n10,60020,1379043329928}
        2013-09-13 03:40:06,044 DEBUG [RpcServer.handler=44,port=60000] zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Deleting existing unassigned node d353d3dc87562957ef5c74559ccb48ca in expected state M_ZK_REGION_CLOSING
        2013-09-13 03:40:06,063 DEBUG [RpcServer.handler=44,port=60000] zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Deleted unassigned node d353d3dc87562957ef5c74559ccb48ca in expected state M_ZK_REGION_CLOSING
        2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] master.RegionStates: Offline d353d3dc87562957ef5c74559ccb48ca with current state=PENDING_CLOSE, expected state=OFFLINE/SPLITTING/MERGING
        2013-09-13 03:40:06,064 INFO  [RpcServer.handler=44,port=60000] master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca state=PENDING_CLOSE, ts=1379043606044, server=hor8n10,60020,1379043329928} to {d353d3dc87562957ef5c74559ccb48ca state=OFFLINE, ts=1379043606064, server=null}
        2013-09-13 03:40:06,064 INFO  [RpcServer.handler=44,port=60000] master.RegionStates: Offlined d353d3dc87562957ef5c74559ccb48ca from hor8n10,60020,1379043329928
        2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] master.HMaster: Region IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca. is not online on any region server, reassigning it.
        2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca. so generated a random one; hri=IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca., src=, dest=hor8n01,60020,1379043340943; 7 (online=7, available=7) available servers, forceNewPlan=false
        2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Creating (or updating) unassigned node d353d3dc87562957ef5c74559ccb48ca with OFFLINE state
        2013-09-13 03:40:06,089 INFO  [RpcServer.handler=44,port=60000] master.AssignmentManager: Assigning IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca. to hor8n01,60020,1379043340943
        2013-09-13 03:40:06,089 INFO  [RpcServer.handler=44,port=60000] master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca state=OFFLINE, ts=1379043606064, server=null} to {d353d3dc87562957ef5c74559ccb48ca state=PENDING_OPEN, ts=1379043606089, server=hor8n01,60020,1379043340943}
        2013-09-13 03:40:06,121 DEBUG [AM.ZK.Worker-pool2-t71] master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=hor8n01,60020,1379043340943, region=d353d3dc87562957ef5c74559ccb48ca, current state from region state map ={d353d3dc87562957ef5c74559ccb48ca state=PENDING_OPEN, ts=1379043606089, server=hor8n01,60020,1379043340943}
        2013-09-13 03:40:06,121 INFO  [AM.ZK.Worker-pool2-t71] master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca state=PENDING_OPEN, ts=1379043606089, server=hor8n01,60020,1379043340943} to {d353d3dc87562957ef5c74559ccb48ca state=OPENING, ts=1379043606121, server=hor8n01,60020,1379043340943}
        

        However, you can see that that at this time, the log splitting is not finished yet:

        2013-09-13 03:40:08,022 INFO  [hor8n02,60000,1379042853257.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 2 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fhor8n10%2C60020%2C1379043329928-splitting%2Fhor8n10%252C60020%252C1379043329928.1379043472554=last_update = 1379043597820 last_version = 2 cur_worker_name = hor8n08,60020,1379043383930 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 6 done = 4 error = 0, /hbase/splitWAL/WALs%2Fhor8n10%2C60020%2C1379043329928-splitting%2Fhor8n10%252C60020%252C1379043329928.1379043582933=last_update = 1379043597824 last_version = 2 cur_worker_name = hor8n02,60020,1379043313923 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 6 done = 4 error = 0}
        

        And indeed, you can see the log splits finishing up after the region is opened:

        ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:06,786 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Applied 47550, skipped 0, firstSequenceidInLog=30260132, maxSequenceidInLog=30335308, path=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335308
        ./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:07,899 DEBUG [split-log-closeStream-1] wal.HLogSplitter: Closing hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp
        ./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:07,971 INFO  [split-log-closeStream-1] wal.HLogSplitter: Closed wap hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp (wrote 2788 edits in 136ms)
        ./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:08,189 DEBUG [split-log-closeStream-1] wal.HLogSplitter: Rename hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp to hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030104001
        ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,571 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000029347901
        ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,732 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000029358999
        ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,965 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030001566
        ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,140 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030197753
        ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,348 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030260110
        ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,514 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335308
        ./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:09,962 DEBUG [split-log-closeStream-2] wal.HLogSplitter: Closing hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp
        ./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:09,992 INFO  [split-log-closeStream-2] wal.HLogSplitter: Closed wap hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp (wrote 808 edits in 195ms)
        ./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:10,057 DEBUG [split-log-closeStream-2] wal.HLogSplitter: Rename hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp to hdfs://hor8n01:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030364775
        

        Notice that the region is already opened (because it is deleting the recovered edits files, but 2 log splitting tasks finish after this)

        Show
        Enis Soztutar added a comment - We have confirmed that the data loss is due to the log splitting + moving region problem: CM kills an RS (hor8n10) and then moves a region (d353d3dc87562957ef5c74559ccb48ca) 13/09/13 03:39:54 INFO actions.Action: Performing action: Restart random region server 13/09/13 03:39:54 INFO actions.Action: Killing region server:hor8n10,60020,1379043329928 13/09/13 03:39:56 INFO actions.Action: Killed region server:hor8n10,60020,1379043329928. Reported num of rs:8 13/09/13 03:39:56 INFO actions.Action: Sleeping for :60000 13/09/13 03:40:05 INFO actions.Action: Performing action: Move random region of table IntegrationTestBigLinkedList The master schedules log splitting and carries it on: 2013-09-13 03:39:57,718 INFO [MASTER_SERVER_OPERATIONS-hor8n02:60000-1] master.SplitLogManager: started splitting 6 logs in [hdfs: //hor8n01:8020/apps/hbase/data/WALs/hor8n10,60020,1379043329928-splitting] In the mean time, CM action for move region kicks in and causes to unassign the region, and assign it to hor8n01 2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000] master.HMaster: Client=hrt_qa //68.142.245.212 unassign IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca. in current location if it is online and reassign.force= false 2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000] master.AssignmentManager: Starting unassign of IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca. (offlining) 2013-09-13 03:40:06,030 DEBUG [RpcServer.handler=44,port=60000] zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Creating unassigned node d353d3dc87562957ef5c74559ccb48ca in a CLOSING state 2013-09-13 03:40:06,044 INFO [RpcServer.handler=44,port=60000] master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca state=OPEN, ts=1379043386591, server=hor8n10,60020,1379043329928} to {d353d3dc87562957ef5c74559ccb48ca state=PENDING_CLOSE, ts=1379043606044, server=hor8n10,60020,1379043329928} 2013-09-13 03:40:06,044 DEBUG [RpcServer.handler=44,port=60000] zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Deleting existing unassigned node d353d3dc87562957ef5c74559ccb48ca in expected state M_ZK_REGION_CLOSING 2013-09-13 03:40:06,063 DEBUG [RpcServer.handler=44,port=60000] zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Deleted unassigned node d353d3dc87562957ef5c74559ccb48ca in expected state M_ZK_REGION_CLOSING 2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] master.RegionStates: Offline d353d3dc87562957ef5c74559ccb48ca with current state=PENDING_CLOSE, expected state=OFFLINE/SPLITTING/MERGING 2013-09-13 03:40:06,064 INFO [RpcServer.handler=44,port=60000] master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca state=PENDING_CLOSE, ts=1379043606044, server=hor8n10,60020,1379043329928} to {d353d3dc87562957ef5c74559ccb48ca state=OFFLINE, ts=1379043606064, server= null } 2013-09-13 03:40:06,064 INFO [RpcServer.handler=44,port=60000] master.RegionStates: Offlined d353d3dc87562957ef5c74559ccb48ca from hor8n10,60020,1379043329928 2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] master.HMaster: Region IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca. is not online on any region server, reassigning it. 2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca. so generated a random one; hri=IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca., src=, dest=hor8n01,60020,1379043340943; 7 (online=7, available=7) available servers, forceNewPlan= false 2013-09-13 03:40:06,064 DEBUG [RpcServer.handler=44,port=60000] zookeeper.ZKAssign: master:60000-0x140e5bf86760799 Creating (or updating) unassigned node d353d3dc87562957ef5c74559ccb48ca with OFFLINE state 2013-09-13 03:40:06,089 INFO [RpcServer.handler=44,port=60000] master.AssignmentManager: Assigning IntegrationTestBigLinkedList,&\x07\x86\xC5;\xE6\xD9\x18&\xCB\x0B\x89\xF9\xD3\xE6\xC4,1379043123821.d353d3dc87562957ef5c74559ccb48ca. to hor8n01,60020,1379043340943 2013-09-13 03:40:06,089 INFO [RpcServer.handler=44,port=60000] master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca state=OFFLINE, ts=1379043606064, server= null } to {d353d3dc87562957ef5c74559ccb48ca state=PENDING_OPEN, ts=1379043606089, server=hor8n01,60020,1379043340943} 2013-09-13 03:40:06,121 DEBUG [AM.ZK.Worker-pool2-t71] master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=hor8n01,60020,1379043340943, region=d353d3dc87562957ef5c74559ccb48ca, current state from region state map ={d353d3dc87562957ef5c74559ccb48ca state=PENDING_OPEN, ts=1379043606089, server=hor8n01,60020,1379043340943} 2013-09-13 03:40:06,121 INFO [AM.ZK.Worker-pool2-t71] master.RegionStates: Transitioned from {d353d3dc87562957ef5c74559ccb48ca state=PENDING_OPEN, ts=1379043606089, server=hor8n01,60020,1379043340943} to {d353d3dc87562957ef5c74559ccb48ca state=OPENING, ts=1379043606121, server=hor8n01,60020,1379043340943} However, you can see that that at this time, the log splitting is not finished yet: 2013-09-13 03:40:08,022 INFO [hor8n02,60000,1379042853257.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 2 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fhor8n10%2C60020%2C1379043329928-splitting%2Fhor8n10%252C60020%252C1379043329928.1379043472554=last_update = 1379043597820 last_version = 2 cur_worker_name = hor8n08,60020,1379043383930 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 6 done = 4 error = 0, /hbase/splitWAL/WALs%2Fhor8n10%2C60020%2C1379043329928-splitting%2Fhor8n10%252C60020%252C1379043329928.1379043582933=last_update = 1379043597824 last_version = 2 cur_worker_name = hor8n02,60020,1379043313923 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 6 done = 4 error = 0} And indeed, you can see the log splits finishing up after the region is opened: ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:06,786 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Applied 47550, skipped 0, firstSequenceidInLog=30260132, maxSequenceidInLog=30335308, path=hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335308 ./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:07,899 DEBUG [split-log-closeStream-1] wal.HLogSplitter: Closing hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp ./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:07,971 INFO [split-log-closeStream-1] wal.HLogSplitter: Closed wap hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp (wrote 2788 edits in 136ms) ./hbase-hbase-regionserver-hor8n08:2013-09-13 03:40:08,189 DEBUG [split-log-closeStream-1] wal.HLogSplitter: Rename hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030002080.temp to hdfs://hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030104001 ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,571 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000029347901 ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,732 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000029358999 ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:08,965 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030001566 ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,140 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030197753 ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,348 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030260110 ./hbase-hbase-regionserver-hor8n01:2013-09-13 03:40:09,514 DEBUG [RS_OPEN_REGION-hor8n01:60020-1] regionserver.HRegion: Deleted recovered.edits file=hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335308 ./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:09,962 DEBUG [split-log-closeStream-2] wal.HLogSplitter: Closing hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp ./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:09,992 INFO [split-log-closeStream-2] wal.HLogSplitter: Closed wap hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp (wrote 808 edits in 195ms) ./hbase-hbase-regionserver-hor8n02:2013-09-13 03:40:10,057 DEBUG [split-log-closeStream-2] wal.HLogSplitter: Rename hdfs: //hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030335361.temp to hdfs://hor8n01:8020/apps/hbase/data/data/ default /IntegrationTestBigLinkedList/d353d3dc87562957ef5c74559ccb48ca/recovered.edits/0000000000030364775 Notice that the region is already opened (because it is deleting the recovered edits files, but 2 log splitting tasks finish after this)
        Hide
        Enis Soztutar added a comment -

        Reopening this. We still have a problem.

        Show
        Enis Soztutar added a comment - Reopening this. We still have a problem.
        Hide
        Enis Soztutar added a comment -

        Unfortunately, I ran into the test failure even without doing the region moves. My inspections yesterday on an earlier run did not reveal a root cause. I'll continue looking for today.

                org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts
                        REFERENCED=799996706
                        UNDEFINED=1647
                        UNREFERENCED=1647
        
        Show
        Enis Soztutar added a comment - Unfortunately, I ran into the test failure even without doing the region moves. My inspections yesterday on an earlier run did not reveal a root cause. I'll continue looking for today. org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts REFERENCED=799996706 UNDEFINED=1647 UNREFERENCED=1647
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #724 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/724/)
        HBASE-9338 Test Big Linked List fails on Hadoop 2.1.0 (eclark: rev 1522713)

        • /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRandomRegionOfTableAction.java
        • /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRegionsOfTableAction.java
        • /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/factories/SlowDeterministicMonkeyFactory.java
        Show
        Hudson added a comment - SUCCESS: Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #724 (See https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/724/ ) HBASE-9338 Test Big Linked List fails on Hadoop 2.1.0 (eclark: rev 1522713) /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRandomRegionOfTableAction.java /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRegionsOfTableAction.java /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/factories/SlowDeterministicMonkeyFactory.java
        Hide
        Devaraj Das added a comment -

        Another one - https://issues.apache.org/jira/browse/HBASE-9525. I think "move" of a region should be done with extreme care. I hope we don't run into these move related issues when the balancer runs (finally leading to region moves).

        Show
        Devaraj Das added a comment - Another one - https://issues.apache.org/jira/browse/HBASE-9525 . I think "move" of a region should be done with extreme care. I hope we don't run into these move related issues when the balancer runs (finally leading to region moves).
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in hbase-0.96-hadoop2 #21 (See https://builds.apache.org/job/hbase-0.96-hadoop2/21/)
        HBASE-9338 Test Big Linked List fails on Hadoop 2.1.0 (eclark: rev 1522712)

        • /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRandomRegionOfTableAction.java
        • /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRegionsOfTableAction.java
        • /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/factories/SlowDeterministicMonkeyFactory.java
        Show
        Hudson added a comment - SUCCESS: Integrated in hbase-0.96-hadoop2 #21 (See https://builds.apache.org/job/hbase-0.96-hadoop2/21/ ) HBASE-9338 Test Big Linked List fails on Hadoop 2.1.0 (eclark: rev 1522712) /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRandomRegionOfTableAction.java /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRegionsOfTableAction.java /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/factories/SlowDeterministicMonkeyFactory.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in HBase-TRUNK #4499 (See https://builds.apache.org/job/HBase-TRUNK/4499/)
        HBASE-9338 Test Big Linked List fails on Hadoop 2.1.0 (eclark: rev 1522713)

        • /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRandomRegionOfTableAction.java
        • /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRegionsOfTableAction.java
        • /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/factories/SlowDeterministicMonkeyFactory.java
        Show
        Hudson added a comment - FAILURE: Integrated in HBase-TRUNK #4499 (See https://builds.apache.org/job/HBase-TRUNK/4499/ ) HBASE-9338 Test Big Linked List fails on Hadoop 2.1.0 (eclark: rev 1522713) /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRandomRegionOfTableAction.java /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRegionsOfTableAction.java /hbase/trunk/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/factories/SlowDeterministicMonkeyFactory.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in hbase-0.96 #40 (See https://builds.apache.org/job/hbase-0.96/40/)
        HBASE-9338 Test Big Linked List fails on Hadoop 2.1.0 (eclark: rev 1522712)

        • /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRandomRegionOfTableAction.java
        • /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRegionsOfTableAction.java
        • /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/factories/SlowDeterministicMonkeyFactory.java
        Show
        Hudson added a comment - SUCCESS: Integrated in hbase-0.96 #40 (See https://builds.apache.org/job/hbase-0.96/40/ ) HBASE-9338 Test Big Linked List fails on Hadoop 2.1.0 (eclark: rev 1522712) /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRandomRegionOfTableAction.java /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/actions/MoveRegionsOfTableAction.java /hbase/branches/0.96/hbase-it/src/test/java/org/apache/hadoop/hbase/chaos/factories/SlowDeterministicMonkeyFactory.java
        Hide
        Elliott Clark added a comment -

        Thanks for all of the help everyone. I just committed the test fix. Jimmy's working on the AM fix in HBASE-9514

        Show
        Elliott Clark added a comment - Thanks for all of the help everyone. I just committed the test fix. Jimmy's working on the AM fix in HBASE-9514
        Hide
        stack added a comment -

        +1

        Meantime Jimmy is figuring prevention of assign while a regions's logs are being replayed.

        Thanks E.

        Show
        stack added a comment - +1 Meantime Jimmy is figuring prevention of assign while a regions's logs are being replayed. Thanks E.
        Hide
        Elliott Clark added a comment -

        So the patch changes slow deterministic monkey. It changes the lists of actions so that MoveRandomRegionOfTableAction and MoveRegionsOfTableAction can't run in parallel with Region servers being killed.

        Then it adds extra sleep in those actions so that there's even more buffer.

        Show
        Elliott Clark added a comment - So the patch changes slow deterministic monkey. It changes the lists of actions so that MoveRandomRegionOfTableAction and MoveRegionsOfTableAction can't run in parallel with Region servers being killed. Then it adds extra sleep in those actions so that there's even more buffer.
        Hide
        Elliott Clark added a comment - - edited

        The last IT test run passed with the version 1 patch.

        Show
        Elliott Clark added a comment - - edited The last IT test run passed with the version 1 patch.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12602803/HBASE-9338-1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 hadoop1.0. The patch compiles against the hadoop 1.0 profile.

        +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 lineLengths. The patch does not introduce lines longer than 100

        +1 site. The mvn site goal succeeds with this patch.

        +1 core tests. The patch passed unit tests in .

        Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
        Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12602803/HBASE-9338-1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 9 new or modified tests. +1 hadoop1.0 . The patch compiles against the hadoop 1.0 profile. +1 hadoop2.0 . The patch compiles against the hadoop 2.0 profile. +1 javadoc . The javadoc tool did not generate any warning messages. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 lineLengths . The patch does not introduce lines longer than 100 +1 site . The mvn site goal succeeds with this patch. +1 core tests . The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/7192//console This message is automatically generated.
        Hide
        Elliott Clark added a comment -

        Something like this is what I was thinking for the test change to make sure we still have move coverage but it's not happening while log splitting is going on.

        I'm running another big linked list run right now to make sure that it works.

        Show
        Elliott Clark added a comment - Something like this is what I was thinking for the test change to make sure we still have move coverage but it's not happening while log splitting is going on. I'm running another big linked list run right now to make sure that it works.
        Hide
        Devaraj Das added a comment -

        After inspecting the logs that led to the creation of HBASE-9480 HBASE-9481, Jeffrey Zhong was convinced (here http://bit.ly/1e3hsfD, and in offline discussions with me) that the region move was the killer for this test. Was he right! (Caveat the issue Enis brought up today on the mail thread on the RC)

        Show
        Devaraj Das added a comment - After inspecting the logs that led to the creation of HBASE-9480 HBASE-9481 , Jeffrey Zhong was convinced (here http://bit.ly/1e3hsfD , and in offline discussions with me) that the region move was the killer for this test. Was he right! (Caveat the issue Enis brought up today on the mail thread on the RC)
        Hide
        Elliott Clark added a comment -

        Ideally both. I think the test fix can go in before a new rc. And then we can work on adding protection for assign later.

        Show
        Elliott Clark added a comment - Ideally both. I think the test fix can go in before a new rc. And then we can work on adding protection for assign later.
        Hide
        stack added a comment -

        Elliott Clark so we need to fix the test? Or add protection so no assign can work during WAL split?

        Show
        stack added a comment - Elliott Clark so we need to fix the test? Or add protection so no assign can work during WAL split?
        Hide
        Elliott Clark added a comment -

        So it looks like what Jimmy Xiang suggested is it.

        I just ran slowNoMove 10 iterations (5 followed by another 5 without clearing the table) in a row with zero data loss. This really seems to suggest that it's related to sending a move to a region that's currently being split.

        If that is indeed the case then I feel much better.

        Show
        Elliott Clark added a comment - So it looks like what Jimmy Xiang suggested is it. I just ran slowNoMove 10 iterations (5 followed by another 5 without clearing the table) in a row with zero data loss. This really seems to suggest that it's related to sending a move to a region that's currently being split. If that is indeed the case then I feel much better.
        Hide
        Devaraj Das added a comment -

        Thinking about it a bit more, the lost(offlined) regions should be got back when the master is restarted. So whenever the CM restarts the master we should get those lost regions back since they would have entries in the meta. The data loss is not permanent in that sense but there are windows of time when the data is not accessible. This may or may not lead to the dataloss issue that Elliott Clark is seeing. My test with the slowDeterministic is still running. Let's see how that goes.

        Show
        Devaraj Das added a comment - Thinking about it a bit more, the lost(offlined) regions should be got back when the master is restarted. So whenever the CM restarts the master we should get those lost regions back since they would have entries in the meta. The data loss is not permanent in that sense but there are windows of time when the data is not accessible. This may or may not lead to the dataloss issue that Elliott Clark is seeing. My test with the slowDeterministic is still running. Let's see how that goes.
        Hide
        Devaraj Das added a comment -

        With the patches from HBASE-9481, HBASE-9480, HBASE-9456, and the patch on this jira, I was finally able to get a successful run. I used the following command:
        hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -Dhbase.client.retries.number=3000 -monkey verySlow Loop 1 12 25000000 IntegrationTestBigLinkedList 12 &

        Note that I increased the retries from a default value of 30 to 3000. This is to make sure that the MR jobs ride over the CM chaos.

        Now I am running the same test with the slowDeterministic CM.

        Yeah, maybe the data loss was caused by the region losses which is at least partially addressed by the HBASE-9481 and HBASE-9480 patches.

        Show
        Devaraj Das added a comment - With the patches from HBASE-9481 , HBASE-9480 , HBASE-9456 , and the patch on this jira, I was finally able to get a successful run. I used the following command: hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -Dhbase.client.retries.number=3000 -monkey verySlow Loop 1 12 25000000 IntegrationTestBigLinkedList 12 & Note that I increased the retries from a default value of 30 to 3000. This is to make sure that the MR jobs ride over the CM chaos. Now I am running the same test with the slowDeterministic CM. Yeah, maybe the data loss was caused by the region losses which is at least partially addressed by the HBASE-9481 and HBASE-9480 patches.
        Hide
        Jimmy Xiang added a comment -

        I think I know the root cause now: assign region out of SSH. A region server is dead, before SSH handles it and completes the log splitting, a CM assigns a dead region. That's why it doesn't happen all the time. Fortunately, that's not a scenario normally happens to a real cluster.

        Let me check if we can prevent from assigning a dead region out of SSH from the master.

        Show
        Jimmy Xiang added a comment - I think I know the root cause now: assign region out of SSH. A region server is dead, before SSH handles it and completes the log splitting, a CM assigns a dead region. That's why it doesn't happen all the time. Fortunately, that's not a scenario normally happens to a real cluster. Let me check if we can prevent from assigning a dead region out of SSH from the master.
        Hide
        Jeffrey Zhong added a comment -

        Elliott Clark In the SlowDeterministic mode, we have several places doing region move such as MoveRandomRegionOfTableAction and MoveRegionsOfTableAction. In the region assignment code, if we try to move a region from a dead server(or a region is being closed and HMaster retry the unassign), the region will be offlined and we'll lose the region(because it won't be re-assigned to any region server anymore). I think it's better to disable region move(and balancer as well) in the tests to isolate if the data loss is caused by region movement combining with other factors.

        Show
        Jeffrey Zhong added a comment - Elliott Clark In the SlowDeterministic mode, we have several places doing region move such as MoveRandomRegionOfTableAction and MoveRegionsOfTableAction. In the region assignment code, if we try to move a region from a dead server(or a region is being closed and HMaster retry the unassign), the region will be offlined and we'll lose the region(because it won't be re-assigned to any region server anymore). I think it's better to disable region move(and balancer as well) in the tests to isolate if the data loss is caused by region movement combining with other factors.
        Hide
        Elliott Clark added a comment - - edited

        Here's the patch that I used to search for missed row keys.

        The row keys in the RowLogSearchJob are the ones that were referenced but not found.

        I found the keys in 5 different wal logs. (so it doesn't seem like it's an issue with reading the wal log)

        This was then checked like this:

        hbase@a1805:/home/eclark$ hbase org.apache.hadoop.hbase.regionserver.wal.HLogPrettyPrinter -w "\\x8E<\\xF3\\xE1f\\x1Al\\x89d\\xD3\\xC1w5\\x9B\\x8FN" hdfs://a1805.halxg.cloudera.com:8020/hbase/oldWALs/a1809.halxg.cloudera.com%2C60020%2C1378540278223.1378540795093
        Row: \x8E<\xF3\xE1f\x1Al\x89d\xD3\xC1w5\x9B\x8FN
        SLF4J: Class path contains multiple SLF4J bindings.
        SLF4J: Found binding in [jar:file:/opt/hbase/jenkins-hbase-custom-branch-34/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
        SLF4J: Found binding in [jar:file:/opt/hadoop/hadoop-2.1.0-beta/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
        SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
        2013-09-09 11:19:43,435 WARN  [main] conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
        Sequence 2846225 from region 8b3a9c2a567f1eca40d25859e9e56fa2 in table IntegrationTestBigLinkedList at write timestamp: Sat Sep 07 00:59:58 PDT 2013
          Action:
            row: \x8E<\xF3\xE1f\x1Al\x89d\xD3\xC1w5\x9B\x8FN
            column: meta:prev
            timestamp: Sat Sep 07 00:59:58 PDT 2013
          Action:
            row: \x8E<\xF3\xE1f\x1Al\x89d\xD3\xC1w5\x9B\x8FN
            column: meta:count
            timestamp: Sat Sep 07 00:59:58 PDT 2013
          Action:
            row: \x8E<\xF3\xE1f\x1Al\x89d\xD3\xC1w5\x9B\x8FN
            column: meta:client
            timestamp: Sat Sep 07 00:59:58 PDT 2013
        
        Show
        Elliott Clark added a comment - - edited Here's the patch that I used to search for missed row keys. The row keys in the RowLogSearchJob are the ones that were referenced but not found. I found the keys in 5 different wal logs. (so it doesn't seem like it's an issue with reading the wal log) This was then checked like this: hbase@a1805:/home/eclark$ hbase org.apache.hadoop.hbase.regionserver.wal.HLogPrettyPrinter -w "\\x8E<\\xF3\\xE1f\\x1Al\\x89d\\xD3\\xC1w5\\x9B\\x8FN" hdfs: //a1805.halxg.cloudera.com:8020/hbase/oldWALs/a1809.halxg.cloudera.com%2C60020%2C1378540278223.1378540795093 Row: \x8E<\xF3\xE1f\x1Al\x89d\xD3\xC1w5\x9B\x8FN SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/hbase/jenkins-hbase-custom-branch-34/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop/hadoop-2.1.0-beta/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http: //www.slf4j.org/codes.html#multiple_bindings for an explanation. 2013-09-09 11:19:43,435 WARN [main] conf.Configuration: fs. default .name is deprecated. Instead, use fs.defaultFS Sequence 2846225 from region 8b3a9c2a567f1eca40d25859e9e56fa2 in table IntegrationTestBigLinkedList at write timestamp: Sat Sep 07 00:59:58 PDT 2013 Action: row: \x8E<\xF3\xE1f\x1Al\x89d\xD3\xC1w5\x9B\x8FN column: meta:prev timestamp: Sat Sep 07 00:59:58 PDT 2013 Action: row: \x8E<\xF3\xE1f\x1Al\x89d\xD3\xC1w5\x9B\x8FN column: meta:count timestamp: Sat Sep 07 00:59:58 PDT 2013 Action: row: \x8E<\xF3\xE1f\x1Al\x89d\xD3\xC1w5\x9B\x8FN column: meta:client timestamp: Sat Sep 07 00:59:58 PDT 2013
        Hide
        Elliott Clark added a comment -

        Sergey Shelukhin Here's an illustration of what I'm seeing:

        hbase(main):002:0> get 'IntegrationTestBigLinkedList', "\x87\x0Bedk\xD4Y_OJ\xC6;\x92\xEFJ!"
        COLUMN                                              CELL                                                                                                                                                  
        0 row(s) in 0.0920 seconds
        
        hbase(main):003:0> get 'IntegrationTestBigLinkedList', "\xF9\xE6~\xE7Bkj\xF9\xC2 \xAE\xA2<\x16\x98\x7F"
        COLUMN                                              CELL                                                                                                                                                  
         meta:client                                        timestamp=1378524691149, value=Job: job_1378337631389_0046 Task: attempt_1378337631389_0046_m_000007_0                                                
         meta:count                                         timestamp=1378524691149, value=\x00\x00\x00\x00\x01|\xBA\xDB                                                                                          
         meta:prev                                          timestamp=1378524691149, value=\x87\x0Bedk\xD4Y_OJ\xC6;\x92\xEFJ!                                                                                     
        3 row(s) in 0.1540 seconds
        
        

        The first row is the missing row (un-defined) the second one is the dangling pointer.

        I'm searching through wal logs now for the missing rows.

        Show
        Elliott Clark added a comment - Sergey Shelukhin Here's an illustration of what I'm seeing: hbase(main):002:0> get 'IntegrationTestBigLinkedList', "\x87\x0Bedk\xD4Y_OJ\xC6;\x92\xEFJ!" COLUMN CELL 0 row(s) in 0.0920 seconds hbase(main):003:0> get 'IntegrationTestBigLinkedList', "\xF9\xE6~\xE7Bkj\xF9\xC2 \xAE\xA2<\x16\x98\x7F" COLUMN CELL meta:client timestamp=1378524691149, value=Job: job_1378337631389_0046 Task: attempt_1378337631389_0046_m_000007_0 meta:count timestamp=1378524691149, value=\x00\x00\x00\x00\x01|\xBA\xDB meta:prev timestamp=1378524691149, value=\x87\x0Bedk\xD4Y_OJ\xC6;\x92\xEFJ! 3 row(s) in 0.1540 seconds The first row is the missing row (un-defined) the second one is the dangling pointer. I'm searching through wal logs now for the missing rows.
        Hide
        Enis Soztutar added a comment -

        Thanks, I am running the tests now with CM as well. Let's see whether I can repro.

        Show
        Enis Soztutar added a comment - Thanks, I am running the tests now with CM as well. Let's see whether I can repro.
        Hide
        Elliott Clark added a comment -

        This shows num nodes per mapper 2.5M, not 25M.

        Yeah that was a typo. It's corrected farther down the page. https://issues.apache.org/jira/browse/HBASE-9338?focusedCommentId=13756918&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13756918

        Here are the exact commands run for every test:

        export SSH_OPTS="-o IdentityFile=id_rsa -o StrictHostKeyChecking=no"
        ssh $SSH_OPTS hbase@$TEST_HOST -x "echo 'disable \"IntegrationTestBigLinkedList\"' | hbase shell"
        ssh $SSH_OPTS hbase@$TEST_HOST -x "echo 'drop \"IntegrationTestBigLinkedList\"' | hbase shell"
        ssh $SSH_OPTS hbase@$TEST_HOST -x "hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList --monkey ${monkey} Loop 5 12 25000000 IntegrationTestBigLinkedList 12"
        

        TEST_HOST is the box that all IT tests are run from for this cluster. It's not part of the hdfs, mr, or hbase cluster, but it does have all the software installed.

        monkey is parameterized. The slowDeterministic fails.

        hbck is run if the above fails.

        Show
        Elliott Clark added a comment - This shows num nodes per mapper 2.5M, not 25M. Yeah that was a typo. It's corrected farther down the page. https://issues.apache.org/jira/browse/HBASE-9338?focusedCommentId=13756918&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13756918 Here are the exact commands run for every test: export SSH_OPTS= "-o IdentityFile=id_rsa -o StrictHostKeyChecking=no" ssh $SSH_OPTS hbase@$TEST_HOST -x "echo 'disable \" IntegrationTestBigLinkedList\ "' | hbase shell" ssh $SSH_OPTS hbase@$TEST_HOST -x "echo 'drop \" IntegrationTestBigLinkedList\ "' | hbase shell" ssh $SSH_OPTS hbase@$TEST_HOST -x "hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList --monkey ${monkey} Loop 5 12 25000000 IntegrationTestBigLinkedList 12" TEST_HOST is the box that all IT tests are run from for this cluster. It's not part of the hdfs, mr, or hbase cluster, but it does have all the software installed. monkey is parameterized. The slowDeterministic fails. hbck is run if the above fails.
        Hide
        Enis Soztutar added a comment -

        hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList Loop 5 12 2500000 IntegrationTestBigLinkedList 10

        This shows num nodes per mapper 2.5M, not 25M. Since the default wrap size is 1M, it might cause unreferences. Though, it should not cause undefineds, and low number of unreferenced's.

        Show
        Enis Soztutar added a comment - hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList Loop 5 12 2500000 IntegrationTestBigLinkedList 10 This shows num nodes per mapper 2.5M, not 25M. Since the default wrap size is 1M, it might cause unreferences. Though, it should not cause undefineds, and low number of unreferenced's.
        Hide
        Elliott Clark added a comment -

        Yes they are all real data loss. I've run the verify multiple times and I never get the data. hbck comes back clean and a get to the missing rows returns nothing.

        I've started a run that will keep all of the data in wal log and I'll see if the data made it there.

        Show
        Elliott Clark added a comment - Yes they are all real data loss. I've run the verify multiple times and I never get the data. hbck comes back clean and a get to the missing rows returns nothing. I've started a run that will keep all of the data in wal log and I'll see if the data made it there.
        Hide
        Sergey Shelukhin added a comment -

        there might be several types of failures. Are all of these real data loss? For the last one it looks like something I saw on 94 (HBASE-8935), but then it was scanner issues during original verify, which went away on verify rerun.

        Show
        Sergey Shelukhin added a comment - there might be several types of failures. Are all of these real data loss? For the last one it looks like something I saw on 94 ( HBASE-8935 ), but then it was scanner issues during original verify, which went away on verify rerun.
        Hide
        Elliott Clark added a comment - - edited

        What's happened

        When the new Hadoop release was in RC and we were getting close to a 0.96 release of our own I changed all of the IT tests to run for longer. When I did this IntegrationTestBigLinkedList and IntegrationTestLoadAndVerify started failing. I at first thought this was due to 2.1.0 but now I'm pretty sure it was just a red herring as the tests pass with a shorter run time. So I started creating new chaos monkeys with different actions turned on and off. Doing that I found that Online Schema Change was what was causing Test Load and Verify to fail. So that was disabled in 0.96.

        Then I started working on Big Linked List. That one has been much harder.

        Observations

        The calm monkey always passes. The unbalance one almost always passes. When it doesn't, the failure is usually due to the assignment manager getting stuck. There appears to be no data loss then.

        Lots of the failures are like this one:

        	Map-Reduce Framework
        		Map input records=599999660
        		Map output records=1199999320
        		Map output bytes=38999977900
        		Map output materialized bytes=41399980716
        		Input split bytes=7360
        		Combine input records=0
        		Combine output records=0
        		Reduce input groups=600000000
        		Reduce shuffle bytes=41399980716
        		Reduce input records=1199999320
        		Reduce output records=680
        		Spilled Records=3576554634
        		Shuffled Maps =696
        		Failed Shuffles=0
        		Merged Map outputs=696
        		GC time elapsed (ms)=96902
        		CPU time spent (ms)=14269000
        		Physical memory (bytes) snapshot=86934548480
        		Virtual memory (bytes) snapshot=157839667200
        		Total committed heap usage (bytes)=86516498432
        	HBase Counters
        		BYTES_IN_REMOTE_RESULTS=29267422740
        		BYTES_IN_RESULTS=35999982060
        		MILLIS_BETWEEN_NEXTS=15450135
        		NOT_SERVING_REGION_EXCEPTION=27
        		NUM_SCANNER_RESTARTS=0
        		REGIONS_SCANNED=100
        		REMOTE_RPC_CALLS=4878206
        		REMOTE_RPC_RETRIES=90
        		RPC_CALLS=6000370
        		RPC_RETRIES=103
        	Shuffle Errors
        		BAD_ID=0
        		CONNECTION=0
        		IO_ERROR=0
        		WRONG_LENGTH=0
        		WRONG_MAP=0
        		WRONG_REDUCE=0
        	org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts
        		REFERENCED=599999320
        		UNDEFINED=340
        		UNREFERENCED=340
        

        The number of map input records is too low. It should always be a multiple of 30m. The number of un-referenced rows is equal to the number of undefined rows. If you add that number to the number of map input records then all rows are accounted for. That suggests that for this job there are lots of places that we are missing exactly on link at each location.

        However that pattern isn't 100%. There are other tests where the number of unreferenced nodes is not equal to the number of undefined rows. So those tests runs have some places in the chain were more then one link is missing. To me this seems to suggest that it's not an off by one error and that it seems like it's on the client side.

        Things I've tried

        • Tried to see if it's just a badly written test
          • The test is hard to follow and could be cleaned up
          • But it's logic seems solid.
        • I thought maybe yarn is the one messing up our data. So I Changed the output keys to match what Load Test and Verify was doing
          • This didn't seem to have any real impact one Online Schema Change was turned off.
        • Tried to see if one map task was failing. So I changed the job to log more and to use a better client column.
          • It's not just one map task that has failures
          • It is always the last generation job that was the writer of data that's missing.
        • Then I tried to see if it's just the beginning or the end of the linked list that is missing data
          • It's not. There is missing data everywhere in the chain.
        • Running Verify step with no monkey
          • This failed suggesting that we are really losing data
        • Then I though it might be the same bug that was causing replication to lose edits. So I re-ran the tests before and after JD's HLog reader change.
          • Tests fail with JD's patch
        • Then I tried Running the test with a Monkey that kills RS and Master
          • This passed
          • I even ran this with more loop iterations and it still passed
        • I Thought maybe it was merge regions. So I turned off merging regions
          • The tests still failed.
        • So I ran a job where inside of the generation job I run a get right after the puts.
          • This was SLOW.
          • Like really really really slow. 10x slower. What has happened to our get speed ?
          • It failed. To me this suggested that this was related to the client. My thought was that the client was silently failing a put.
        • Then I thought that it was possible that the gets were just arriving at the region servers after the data loss had happened.
        • So I moved the get directly after the puts.
          • This required flushing a lot more. So it was slow.
          • The first two iterations passed but it would be 3 days for me to finish the entire five iterations so I stopped the run.
          • Everything was passing though
        • Next I thought it could be a JVM issue. 1.7.15 had been a little flakey on me.
        • So I upgraded to 1.7.25.
          • This had no effect.
        • Next run I started looking at where the data was lost.
          • All the data I lost in my last run was in the same region.
          • All of it was around the same time.
          • Different mappers.

        Sample Dangling Pointers

        => #<#<Class:0x2123dde6>:0x5f1840c3>
        hbase(main):002:0> get 'IntegrationTestBigLinkedList', "Y7\xEC\xF6w\xDB\x92\x88L\x96\x90\xA0\x17\xCE\xB9\x92"
        COLUMN                                              CELL                                                                                                                                                  
         meta:client                                        timestamp=1378465754975, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000007_0                                                
         meta:count                                         timestamp=1378465754975, value=\x00\x00\x00\x00\x01:{\xE9                                                                                             
         meta:prev                                          timestamp=1378465754975, value=\xAE\xE1\x9C\xDAY~",\x08\xE2L\xB8\x06\x13\x81H                                                                         
        3 row(s) in 0.0920 seconds
        
        hbase(main):003:0> get 'IntegrationTestBigLinkedList', "\xB0\x85\x1D\xF4\x15%\xA2\xB2D>\x94\xE5\x9D\xE3\x9B\xA4"
        COLUMN                                              CELL                                                                                                                                                  
         meta:client                                        timestamp=1378465679500, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000001_0                                                
         meta:count                                         timestamp=1378465679500, value=\x00\x00\x00\x00\x01_\x97z                                                                                             
         meta:prev                                          timestamp=1378465679500, value=\xAE\xDE\x05\xFDVT\xA2p\x08L\x15\x93+\xEB\xBD\x19                                                                      
        3 row(s) in 0.0940 seconds
        
        hbase(main):004:0> get 'IntegrationTestBigLinkedList', "s\xC3Z\x8B\xAD*\xEF\xF3\x1C\xDAa\x5C#\xE29\x94"
        COLUMN                                              CELL                                                                                                                                                  
         meta:client                                        timestamp=1378465746735, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0                                                
         meta:count                                         timestamp=1378465746735, value=\x00\x00\x00\x00\x01Z:\xEB                                                                                             
         meta:prev                                          timestamp=1378465746735, value=\xAE\xD5GcLOJ\xBD\x1B\xC2\xEE\xDD\x02\xBCi\x0A                                                                         
        3 row(s) in 0.1850 seconds
        
        hbase(main):005:0> get 'IntegrationTestBigLinkedList', "T\x91$\xE0\xB3~\xB6\x88\x03\x9Df\x99\xE1\xA7qX"
        COLUMN                                              CELL                                                                                                                                                  
         meta:client                                        timestamp=1378465679216, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0                                                
         meta:count                                         timestamp=1378465679216, value=\x00\x00\x00\x00\x01P\xC1!                                                                                             
         meta:prev                                          timestamp=1378465679216, value=\xAE\xCE\xA8\xCDX\x8A\xC7H(t\xBAx\xEF|\x0E\xCD                                                                         
        3 row(s) in 0.1400 seconds
        
        hbase(main):006:0> get 'IntegrationTestBigLinkedList', "\xB1\xC3+h\xEE\xE0{\x03p\x0C\x83\x87L\x96\xA1\xBB"
        COLUMN                                              CELL                                                                                                                                                  
         meta:client                                        timestamp=1378465684324, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000003_0                                                
         meta:count                                         timestamp=1378465684324, value=\x00\x00\x00\x00\x01b\x12\xCA                                                                                          
         meta:prev                                          timestamp=1378465684324, value=\xAE\xC5\x17H[\xD2o\x9C\xF5\xE0\x8F\xB87\x81\xA7C                                                                      
        3 row(s) in 0.0730 seconds
        
        hbase(main):007:0> get 'IntegrationTestBigLinkedList', "|@\x1AL\x17\x9E\xF2s\xD8\xD3\x07%\x11l\xDA\x83"
        COLUMN                                              CELL                                                                                                                                                  
         meta:client                                        timestamp=1378465730726, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0                                                
         meta:count                                         timestamp=1378465730726, value=\x00\x00\x00\x00\x01R\x99\x0D                                                                                          
         meta:prev                                          timestamp=1378465730726, value=\xAD\xE10\x10t\xA9\x0D+]\xFC\xA4\xD3\x9A`\x06\xD6                                                                      
        3 row(s) in 0.1530 seconds
        
        hbase(main):008:0> get 'IntegrationTestBigLinkedList', "|0\xFF8/\x86p ^V\xA4\xD0@\x0A\x0E\xC1"
        COLUMN                                              CELL                                                                                                                                                  
         meta:client                                        timestamp=1378465752879, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0                                                
         meta:count                                         timestamp=1378465752879, value=\x00\x00\x00\x00\x01Z\xC8%                                                                                             
         meta:prev                                          timestamp=1378465752879, value=\xAE\x94)\x93+\xDEa{_\x81\x9Ft,\x9B^\xB6                                                                               
        3 row(s) in 0.0780 seconds
        

        All of the missing rows start with "\xAE" or "\xAD"

        hbase(main):002:0> java.util.Date.new(1378465752879).toString
        => "Fri Sep 06 04:09:12 PDT 2013"
        hbase(main):003:0> java.util.Date.new(1378465730726).toString
        => "Fri Sep 06 04:08:50 PDT 2013"
        hbase(main):004:0> java.util.Date.new(1378465684324).toString
        => "Fri Sep 06 04:08:04 PDT 2013"
        hbase(main):005:0> java.util.Date.new(1378465679216).toString
        => "Fri Sep 06 04:07:59 PDT 2013"
        hbase(main):006:0> java.util.Date.new(1378465746735).toString
        => "Fri Sep 06 04:09:06 PDT 2013"
        hbase(main):007:0> java.util.Date.new(1378465679500).toString
        => "Fri Sep 06 04:07:59 PDT 2013"
        hbase(main):008:0> java.util.Date.new(1378465754975).toString
        => "Fri Sep 06 04:09:14 PDT 2013"
        

        The dangling pointers are all written within 3 mins of each other.

        Side Note

        JDK7 seems to core dump a lot when starting up region servers. Once things are all set up it's stable. I need to turn on core dumps into the a known folder so the Jenkins can archive them.

        Show
        Elliott Clark added a comment - - edited What's happened When the new Hadoop release was in RC and we were getting close to a 0.96 release of our own I changed all of the IT tests to run for longer. When I did this IntegrationTestBigLinkedList and IntegrationTestLoadAndVerify started failing. I at first thought this was due to 2.1.0 but now I'm pretty sure it was just a red herring as the tests pass with a shorter run time. So I started creating new chaos monkeys with different actions turned on and off. Doing that I found that Online Schema Change was what was causing Test Load and Verify to fail. So that was disabled in 0.96. Then I started working on Big Linked List. That one has been much harder. Observations The calm monkey always passes. The unbalance one almost always passes. When it doesn't, the failure is usually due to the assignment manager getting stuck. There appears to be no data loss then. Lots of the failures are like this one: Map-Reduce Framework Map input records=599999660 Map output records=1199999320 Map output bytes=38999977900 Map output materialized bytes=41399980716 Input split bytes=7360 Combine input records=0 Combine output records=0 Reduce input groups=600000000 Reduce shuffle bytes=41399980716 Reduce input records=1199999320 Reduce output records=680 Spilled Records=3576554634 Shuffled Maps =696 Failed Shuffles=0 Merged Map outputs=696 GC time elapsed (ms)=96902 CPU time spent (ms)=14269000 Physical memory (bytes) snapshot=86934548480 Virtual memory (bytes) snapshot=157839667200 Total committed heap usage (bytes)=86516498432 HBase Counters BYTES_IN_REMOTE_RESULTS=29267422740 BYTES_IN_RESULTS=35999982060 MILLIS_BETWEEN_NEXTS=15450135 NOT_SERVING_REGION_EXCEPTION=27 NUM_SCANNER_RESTARTS=0 REGIONS_SCANNED=100 REMOTE_RPC_CALLS=4878206 REMOTE_RPC_RETRIES=90 RPC_CALLS=6000370 RPC_RETRIES=103 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts REFERENCED=599999320 UNDEFINED=340 UNREFERENCED=340 The number of map input records is too low. It should always be a multiple of 30m. The number of un-referenced rows is equal to the number of undefined rows. If you add that number to the number of map input records then all rows are accounted for. That suggests that for this job there are lots of places that we are missing exactly on link at each location. However that pattern isn't 100%. There are other tests where the number of unreferenced nodes is not equal to the number of undefined rows. So those tests runs have some places in the chain were more then one link is missing. To me this seems to suggest that it's not an off by one error and that it seems like it's on the client side. Things I've tried Tried to see if it's just a badly written test The test is hard to follow and could be cleaned up But it's logic seems solid. I thought maybe yarn is the one messing up our data. So I Changed the output keys to match what Load Test and Verify was doing This didn't seem to have any real impact one Online Schema Change was turned off. Tried to see if one map task was failing. So I changed the job to log more and to use a better client column. It's not just one map task that has failures It is always the last generation job that was the writer of data that's missing. Then I tried to see if it's just the beginning or the end of the linked list that is missing data It's not. There is missing data everywhere in the chain. Running Verify step with no monkey This failed suggesting that we are really losing data Then I though it might be the same bug that was causing replication to lose edits. So I re-ran the tests before and after JD's HLog reader change. Tests fail with JD's patch Then I tried Running the test with a Monkey that kills RS and Master This passed I even ran this with more loop iterations and it still passed I Thought maybe it was merge regions. So I turned off merging regions The tests still failed. So I ran a job where inside of the generation job I run a get right after the puts. This was SLOW. Like really really really slow. 10x slower. What has happened to our get speed ? It failed. To me this suggested that this was related to the client. My thought was that the client was silently failing a put. Then I thought that it was possible that the gets were just arriving at the region servers after the data loss had happened. So I moved the get directly after the puts. This required flushing a lot more. So it was slow. The first two iterations passed but it would be 3 days for me to finish the entire five iterations so I stopped the run. Everything was passing though Next I thought it could be a JVM issue. 1.7.15 had been a little flakey on me. So I upgraded to 1.7.25. This had no effect. Next run I started looking at where the data was lost. All the data I lost in my last run was in the same region. All of it was around the same time. Different mappers. Sample Dangling Pointers => #<#< Class :0x2123dde6>:0x5f1840c3> hbase(main):002:0> get 'IntegrationTestBigLinkedList', "Y7\xEC\xF6w\xDB\x92\x88L\x96\x90\xA0\x17\xCE\xB9\x92" COLUMN CELL meta:client timestamp=1378465754975, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000007_0 meta:count timestamp=1378465754975, value=\x00\x00\x00\x00\x01:{\xE9 meta:prev timestamp=1378465754975, value=\xAE\xE1\x9C\xDAY~",\x08\xE2L\xB8\x06\x13\x81H 3 row(s) in 0.0920 seconds hbase(main):003:0> get 'IntegrationTestBigLinkedList', "\xB0\x85\x1D\xF4\x15%\xA2\xB2D>\x94\xE5\x9D\xE3\x9B\xA4" COLUMN CELL meta:client timestamp=1378465679500, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000001_0 meta:count timestamp=1378465679500, value=\x00\x00\x00\x00\x01_\x97z meta:prev timestamp=1378465679500, value=\xAE\xDE\x05\xFDVT\xA2p\x08L\x15\x93+\xEB\xBD\x19 3 row(s) in 0.0940 seconds hbase(main):004:0> get 'IntegrationTestBigLinkedList', "s\xC3Z\x8B\xAD*\xEF\xF3\x1C\xDAa\x5C#\xE29\x94" COLUMN CELL meta:client timestamp=1378465746735, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0 meta:count timestamp=1378465746735, value=\x00\x00\x00\x00\x01Z:\xEB meta:prev timestamp=1378465746735, value=\xAE\xD5GcLOJ\xBD\x1B\xC2\xEE\xDD\x02\xBCi\x0A 3 row(s) in 0.1850 seconds hbase(main):005:0> get 'IntegrationTestBigLinkedList', "T\x91$\xE0\xB3~\xB6\x88\x03\x9Df\x99\xE1\xA7qX" COLUMN CELL meta:client timestamp=1378465679216, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0 meta:count timestamp=1378465679216, value=\x00\x00\x00\x00\x01P\xC1! meta:prev timestamp=1378465679216, value=\xAE\xCE\xA8\xCDX\x8A\xC7H(t\xBAx\xEF|\x0E\xCD 3 row(s) in 0.1400 seconds hbase(main):006:0> get 'IntegrationTestBigLinkedList', "\xB1\xC3+h\xEE\xE0{\x03p\x0C\x83\x87L\x96\xA1\xBB" COLUMN CELL meta:client timestamp=1378465684324, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000003_0 meta:count timestamp=1378465684324, value=\x00\x00\x00\x00\x01b\x12\xCA meta:prev timestamp=1378465684324, value=\xAE\xC5\x17H[\xD2o\x9C\xF5\xE0\x8F\xB87\x81\xA7C 3 row(s) in 0.0730 seconds hbase(main):007:0> get 'IntegrationTestBigLinkedList', "|@\x1AL\x17\x9E\xF2s\xD8\xD3\x07%\x11l\xDA\x83" COLUMN CELL meta:client timestamp=1378465730726, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000000_0 meta:count timestamp=1378465730726, value=\x00\x00\x00\x00\x01R\x99\x0D meta:prev timestamp=1378465730726, value=\xAD\xE10\x10t\xA9\x0D+]\xFC\xA4\xD3\x9A`\x06\xD6 3 row(s) in 0.1530 seconds hbase(main):008:0> get 'IntegrationTestBigLinkedList', "|0\xFF8/\x86p ^V\xA4\xD0@\x0A\x0E\xC1" COLUMN CELL meta:client timestamp=1378465752879, value=Job: job_1378337631389_0032 Task: attempt_1378337631389_0032_m_000002_0 meta:count timestamp=1378465752879, value=\x00\x00\x00\x00\x01Z\xC8% meta:prev timestamp=1378465752879, value=\xAE\x94)\x93+\xDEa{_\x81\x9Ft,\x9B^\xB6 3 row(s) in 0.0780 seconds All of the missing rows start with "\xAE" or "\xAD" hbase(main):002:0> java.util.Date. new (1378465752879).toString => "Fri Sep 06 04:09:12 PDT 2013" hbase(main):003:0> java.util.Date. new (1378465730726).toString => "Fri Sep 06 04:08:50 PDT 2013" hbase(main):004:0> java.util.Date. new (1378465684324).toString => "Fri Sep 06 04:08:04 PDT 2013" hbase(main):005:0> java.util.Date. new (1378465679216).toString => "Fri Sep 06 04:07:59 PDT 2013" hbase(main):006:0> java.util.Date. new (1378465746735).toString => "Fri Sep 06 04:09:06 PDT 2013" hbase(main):007:0> java.util.Date. new (1378465679500).toString => "Fri Sep 06 04:07:59 PDT 2013" hbase(main):008:0> java.util.Date. new (1378465754975).toString => "Fri Sep 06 04:09:14 PDT 2013" The dangling pointers are all written within 3 mins of each other. Side Note JDK7 seems to core dump a lot when starting up region servers. Once things are all set up it's stable. I need to turn on core dumps into the a known folder so the Jenkins can archive them.
        Hide
        stack added a comment -

        Rah-rah Elliott Clark!

        Show
        stack added a comment - Rah-rah Elliott Clark !
        Hide
        Elliott Clark added a comment -

        Here's a patch with some test changes I have been using to try and debug this issue.

        Last night I ran a run with restart only chaos monkey. That passed.

        Then after that I tried to run slow deterministic with no merge region. That Failed because of an assignment manager issue. There were 100 regions that were RIT for 4 hours. (I've seen this before and its something we should look at but probably not a release blocker).

        Then this morning I started another slow deterministic with no merge and that failed. So this leads me to believe that the data loss has something to do with two different actions occurring at the the same time. It also seems to suggest that region merge isn't the issue.

        Show
        Elliott Clark added a comment - Here's a patch with some test changes I have been using to try and debug this issue. Last night I ran a run with restart only chaos monkey. That passed. Then after that I tried to run slow deterministic with no merge region. That Failed because of an assignment manager issue. There were 100 regions that were RIT for 4 hours. (I've seen this before and its something we should look at but probably not a release blocker). Then this morning I started another slow deterministic with no merge and that failed. So this leads me to believe that the data loss has something to do with two different actions occurring at the the same time. It also seems to suggest that region merge isn't the issue.
        Hide
        Elliott Clark added a comment -

        I'm running a test currently with just regionserver/masters kills and no merge. We'll see if that does anything.

        Show
        Elliott Clark added a comment - I'm running a test currently with just regionserver/masters kills and no merge. We'll see if that does anything.
        Hide
        stack added a comment -

        Thanks Elliott Clark Agree this candidate to sink the RC.

        Show
        stack added a comment - Thanks Elliott Clark Agree this candidate to sink the RC.
        Hide
        Elliott Clark added a comment -

        Still seeing this 100% of the time on out test cluster.

        Running:

        hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList --monkey slowDeterministic Loop 5 12 25000000 IntegrationTestBigLinkedList 12
        

        Fails every time. Usually it will get through the first few iterations but it will fail before the 5th. Resulting in:

        	org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts
        		REFERENCED=599327221
        		UNDEFINED=336384
        		UNREFERENCED=336384
        
        	org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts
        		REFERENCED=899938580
        		UNDEFINED=29048
        		UNREFERENCED=32373
        

        Disabling online schema change made Load and Verify pass. However there is still something wrong. I went through the MR output and the keys listed as undefined are really missing (going on the shell they are never returned). So it looks like we are missing single links in the chain.

        I've modified the job to produce a more usable client column. I've used that to look at the logs for each attempt. Nothing seems too much amiss in the logs. Can someone run the same tests with a high number of iterations (Seems this always passes on the first loop) ?

        Show
        Elliott Clark added a comment - Still seeing this 100% of the time on out test cluster. Running: hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList --monkey slowDeterministic Loop 5 12 25000000 IntegrationTestBigLinkedList 12 Fails every time. Usually it will get through the first few iterations but it will fail before the 5th. Resulting in: org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts REFERENCED=599327221 UNDEFINED=336384 UNREFERENCED=336384 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts REFERENCED=899938580 UNDEFINED=29048 UNREFERENCED=32373 Disabling online schema change made Load and Verify pass. However there is still something wrong. I went through the MR output and the keys listed as undefined are really missing (going on the shell they are never returned). So it looks like we are missing single links in the chain. I've modified the job to produce a more usable client column. I've used that to look at the logs for each attempt. Nothing seems too much amiss in the logs. Can someone run the same tests with a high number of iterations (Seems this always passes on the first loop) ?
        Hide
        Elliott Clark added a comment -

        Ok, so I'm pretty sure that I have it. Hadoop 2.1.0 was a red herring. I just noticed the failures around then because I had upped the test iterations in preparation for a 0.96 rc.

        Pretty sure the cause is Online Schema Change. I've been running lots of different tests, and whenever I have those chaos monkey actions then IntegrationTestLoadAndVerify passes. I'm running IntegrationTestBigLinkedList right now to make sure that is the issue for both. But I think there is some bug where data can be skipped or returned mal-formed while the change is in progress. So if the IntegrationTestBigLinkedList run passes (takes about 8 hours with the number of iterations we run on the cluster) then I think we should disable Online Schema Change for 0.96.

        I know I was the one that enabled it, but I think this will help with stability for the upcoming release.

        Show
        Elliott Clark added a comment - Ok, so I'm pretty sure that I have it. Hadoop 2.1.0 was a red herring. I just noticed the failures around then because I had upped the test iterations in preparation for a 0.96 rc. Pretty sure the cause is Online Schema Change. I've been running lots of different tests, and whenever I have those chaos monkey actions then IntegrationTestLoadAndVerify passes. I'm running IntegrationTestBigLinkedList right now to make sure that is the issue for both. But I think there is some bug where data can be skipped or returned mal-formed while the change is in progress. So if the IntegrationTestBigLinkedList run passes (takes about 8 hours with the number of iterations we run on the cluster) then I think we should disable Online Schema Change for 0.96. I know I was the one that enabled it, but I think this will help with stability for the upcoming release.
        Hide
        Elliott Clark added a comment -

        I'm deploying a more stock config (without any of the changed rpc/ipc timeouts). Lets see if that helps.

        Show
        Elliott Clark added a comment - I'm deploying a more stock config (without any of the changed rpc/ipc timeouts). Lets see if that helps.
        Hide
        Elliott Clark added a comment -

        When you have a moment, please explain this a little more. It's not apparent from the jira...

        Done on that issue, so that it should be searchable later on.

        Show
        Elliott Clark added a comment - When you have a moment, please explain this a little more. It's not apparent from the jira... Done on that issue, so that it should be searchable later on.
        Hide
        Devaraj Das added a comment -

        Word of caution: If you don't have HBASE-9253 then you are testing 1/100th of what is intended.

        When you have a moment, please explain this a little more. It's not apparent from the jira...

        Show
        Devaraj Das added a comment - Word of caution: If you don't have HBASE-9253 then you are testing 1/100th of what is intended. When you have a moment, please explain this a little more. It's not apparent from the jira...
        Hide
        Elliott Clark added a comment -

        These failures are consistent when the ChaosMonkey is turned on to slow deterministic. If it's turned off then the tests succeeded.

        Here's an example:

        	Shuffle Errors
        		BAD_ID=0
        		CONNECTION=0
        		IO_ERROR=0
        		WRONG_LENGTH=0
        		WRONG_MAP=0
        		WRONG_REDUCE=0
        	org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts
        		REFERENCED=1199999908
        		UNDEFINED=46
        		UNREFERENCED=46
        	File Input Format Counters 
        		Bytes Read=0
        	File Output Format Counters 
        		Bytes Written=6866
        2013-08-27 04:56:16,792 ERROR [main] test.IntegrationTestBigLinkedList$Verify: Expected referenced count does not match with actual referenced count. expected referenced=1200000000 ,actual=1199999908
        2013-08-27 04:56:16,792 ERROR [main] test.IntegrationTestBigLinkedList$Verify: Unreferenced nodes were not expected. Unreferenced count=46
        2013-08-27 04:56:16,792 ERROR [main] test.IntegrationTestBigLinkedList$Verify: Found an undefined node. Undefined count=46
        2013-08-27 04:57:00,129 INFO  [main] hbase.ClusterManager: Executing remote command: ps aux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' ' -f2 , hostname:a1806.halxg.cloudera.com
        2013-08-27 04:57:00,130 INFO  [main] util.Shell: Executing full command [/usr/bin/ssh -o ConnectTimeout=1 -o StrictHostKeyChecking=no a1806.halxg.cloudera.com "ps aux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' ' -f2"]
        2013-08-27 04:57:00,945 INFO  [main] hbase.ClusterManager: Executed remote command, exit code:0 , output:
        2013-08-27 04:57:00,945 INFO  [main] hbase.HBaseCluster: Starting RS on: a1806.halxg.cloudera.com
        2013-08-27 04:57:00,945 INFO  [main] hbase.ClusterManager: Executing remote command: /opt/hbase/current/bin/../bin/hbase-daemon.sh  start regionserver , hostname:a1806.halxg.cloudera.com
        2013-08-27 04:57:00,946 INFO  [main] util.Shell: Executing full command [/usr/bin/ssh -o ConnectTimeout=1 -o StrictHostKeyChecking=no a1806.halxg.cloudera.com "/opt/hbase/current/bin/../bin/hbase-daemon.sh  start regionserver"]
        2013-08-27 04:57:02,269 INFO  [main] hbase.ClusterManager: Executed remote command, exit code:0 , output:starting regionserver, logging to /opt/hbase/current/bin/../logs/hbase-hbase-regionserver-a1806.halxg.cloudera.com.out
        
        2013-08-27 04:57:02,270 INFO  [main] hbase.HBaseCluster: Added new HBaseAdmin
        2013-08-27 04:57:02,271 ERROR [main] util.AbstractHBaseTool: Error running command-line tool
        java.lang.RuntimeException: Verify.verify failed
        	at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Loop.runVerify(IntegrationTestBigLinkedList.java:727)
        	at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Loop.run(IntegrationTestBigLinkedList.java:760)
        	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        	at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList.runTestFromCommandLine(IntegrationTestBigLinkedList.java:1059)
        	at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:77)
        	at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112)
        	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        	at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList.main(IntegrationTestBigLinkedList.java:1094)
        Build step 'Execute shell' marked build as failure
        

        Word of caution: If you don't have HBASE-9253 then you are testing 1/100th of what is intended.

        Show
        Elliott Clark added a comment - These failures are consistent when the ChaosMonkey is turned on to slow deterministic. If it's turned off then the tests succeeded. Here's an example: Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts REFERENCED=1199999908 UNDEFINED=46 UNREFERENCED=46 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=6866 2013-08-27 04:56:16,792 ERROR [main] test.IntegrationTestBigLinkedList$Verify: Expected referenced count does not match with actual referenced count. expected referenced=1200000000 ,actual=1199999908 2013-08-27 04:56:16,792 ERROR [main] test.IntegrationTestBigLinkedList$Verify: Unreferenced nodes were not expected. Unreferenced count=46 2013-08-27 04:56:16,792 ERROR [main] test.IntegrationTestBigLinkedList$Verify: Found an undefined node. Undefined count=46 2013-08-27 04:57:00,129 INFO [main] hbase.ClusterManager: Executing remote command: ps aux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' ' -f2 , hostname:a1806.halxg.cloudera.com 2013-08-27 04:57:00,130 INFO [main] util.Shell: Executing full command [/usr/bin/ssh -o ConnectTimeout=1 -o StrictHostKeyChecking=no a1806.halxg.cloudera.com "ps aux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' ' -f2" ] 2013-08-27 04:57:00,945 INFO [main] hbase.ClusterManager: Executed remote command, exit code:0 , output: 2013-08-27 04:57:00,945 INFO [main] hbase.HBaseCluster: Starting RS on: a1806.halxg.cloudera.com 2013-08-27 04:57:00,945 INFO [main] hbase.ClusterManager: Executing remote command: /opt/hbase/current/bin/../bin/hbase-daemon.sh start regionserver , hostname:a1806.halxg.cloudera.com 2013-08-27 04:57:00,946 INFO [main] util.Shell: Executing full command [/usr/bin/ssh -o ConnectTimeout=1 -o StrictHostKeyChecking=no a1806.halxg.cloudera.com "/opt/hbase/current/bin/../bin/hbase-daemon.sh start regionserver" ] 2013-08-27 04:57:02,269 INFO [main] hbase.ClusterManager: Executed remote command, exit code:0 , output:starting regionserver, logging to /opt/hbase/current/bin/../logs/hbase-hbase-regionserver-a1806.halxg.cloudera.com.out 2013-08-27 04:57:02,270 INFO [main] hbase.HBaseCluster: Added new HBaseAdmin 2013-08-27 04:57:02,271 ERROR [main] util.AbstractHBaseTool: Error running command-line tool java.lang.RuntimeException: Verify.verify failed at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Loop.runVerify(IntegrationTestBigLinkedList.java:727) at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Loop.run(IntegrationTestBigLinkedList.java:760) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList.runTestFromCommandLine(IntegrationTestBigLinkedList.java:1059) at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:77) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList.main(IntegrationTestBigLinkedList.java:1094) Build step 'Execute shell' marked build as failure Word of caution: If you don't have HBASE-9253 then you are testing 1/100th of what is intended.
        Hide
        Devaraj Das added a comment -

        Just FYI, in our internal QA infra we run these tests as well and they run successfully on top of 2.1 (not sure on the exact rev of the 2.1 branch). In talking about this problem with one of our QA engineers, we were wondering if HBASE-9253 could have caused any regression (we don't have that patch in our builds).

        Show
        Devaraj Das added a comment - Just FYI, in our internal QA infra we run these tests as well and they run successfully on top of 2.1 (not sure on the exact rev of the 2.1 branch). In talking about this problem with one of our QA engineers, we were wondering if HBASE-9253 could have caused any regression (we don't have that patch in our builds).
        Hide
        Devaraj Das added a comment -

        Elliott, are these failures consistent (every run fails)?

        Show
        Devaraj Das added a comment - Elliott, are these failures consistent (every run fails)?
        Hide
        Elliott Clark added a comment -

        I'm also seeing Test Load and Verify fail as well (only with a slowDeterministic chaos monkey). The hard part is that these tests take 3 hours for each run or more.

        Show
        Elliott Clark added a comment - I'm also seeing Test Load and Verify fail as well (only with a slowDeterministic chaos monkey). The hard part is that these tests take 3 hours for each run or more.
        Hide
        Elliott Clark added a comment -

        Unreferenced. I'm running again on top of the tip of 0.95.

        Show
        Elliott Clark added a comment - Unreferenced. I'm running again on top of the tip of 0.95.
        Hide
        Sergey Shelukhin added a comment -

        Do the longer runs have similar numbers (a lot of unreferenced, no undefined?).
        This could be two separate issues, we saw some issue related to scanner timeout when reading on 94, but that one had undefined items in results.

        Show
        Sergey Shelukhin added a comment - Do the longer runs have similar numbers (a lot of unreferenced, no undefined?). This could be two separate issues, we saw some issue related to scanner timeout when reading on 94, but that one had undefined items in results.
        Hide
        Elliott Clark added a comment -

        Hmm that might not be the issue. Longer running tests are seeing the same thing again.... Blah.

        Show
        Elliott Clark added a comment - Hmm that might not be the issue. Longer running tests are seeing the same thing again.... Blah.
        Hide
        stack added a comment -

        +1

        Show
        stack added a comment - +1
        Hide
        Elliott Clark added a comment -

        Looks like hadoop has changed the way they will serialize a null byte array. Changed the test to use the empty byte array constant.

        Show
        Elliott Clark added a comment - Looks like hadoop has changed the way they will serialize a null byte array. Changed the test to use the empty byte array constant.
        Hide
        Elliott Clark added a comment -

        Here's the command run:

        hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList Loop 5 12 2500000 IntegrationTestBigLinkedList 10
        

        And the result:

        	org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts
        		REFERENCED=11520689
        		UNREFERENCED=11520770
        	File Input Format Counters 
        		Bytes Read=0
        	File Output Format Counters 
        		Bytes Written=972064362
        2013-08-26 11:48:03,060 INFO  [main] hbase.HBaseCluster: Added new HBaseAdmin
        2013-08-26 11:48:03,061 ERROR [main] util.AbstractHBaseTool: Error running command-line tool
        java.lang.RuntimeException: Verify.run failed with return code: 1
        	at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Loop.runVerify(IntegrationTestBigLinkedList.java:726)
        	at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Loop.run(IntegrationTestBigLinkedList.java:764)
        	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        	at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList.runTestFromCommandLine(IntegrationTestBigLinkedList.java:1063)
        	at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:77)
        	at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112)
        	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        	at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList.main(IntegrationTestBigLinkedList.java:1098)
        Show
        Elliott Clark added a comment - Here's the command run: hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList Loop 5 12 2500000 IntegrationTestBigLinkedList 10 And the result: org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Verify$Counts REFERENCED=11520689 UNREFERENCED=11520770 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=972064362 2013-08-26 11:48:03,060 INFO [main] hbase.HBaseCluster: Added new HBaseAdmin 2013-08-26 11:48:03,061 ERROR [main] util.AbstractHBaseTool: Error running command-line tool java.lang.RuntimeException: Verify.run failed with return code: 1 at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Loop.runVerify(IntegrationTestBigLinkedList.java:726) at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Loop.run(IntegrationTestBigLinkedList.java:764) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList.runTestFromCommandLine(IntegrationTestBigLinkedList.java:1063) at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:77) at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList.main(IntegrationTestBigLinkedList.java:1098)
        Hide
        Elliott Clark added a comment -

        Configs on the cluster:

        core-site.xml

        <?xml version="1.0"?>
        <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
        
        <!-- Put site-specific property overrides in this file. -->
        
        <configuration>
        
          <property>
            <name>fs.default.name</name>
            <value>hdfs://${MASTER_HOSTNAME}:8020/</value>
          </property>
        
          <property>
            <name>hadoop.data.dir0</name>
            <value>/data0/</value>
          </property>
        
          <property>
            <name>hadoop.data.dir1</name>
            <value>/data1/</value>
          </property>
        
          <property>
            <name>hadoop.data.dir2</name>
            <value>/data2/</value>
          </property>
        
          <property>
            <name>hadoop.data.dir3</name>
            <value>/data3/</value>
          </property>
        
          <property>
            <name>hadoop.data.dir4</name>
            <value>/data4/</value>
          </property>
        
          <property>
            <name>hadoop.data.dir5</name>
            <value>/data5/</value>
          </property>
        
          <property>
            <name>mapred.temp.dir</name>
            <value>${hadoop.data.dir1}/mapred/temp</value>
            <description>A shared directory for temporary files.
            </description>
          </property>
        
          <property>
            <name>ipc.client.connect.timeout</name>
            <value>1000</value>
          </property>
        
          <property>
            <name>ipc.client.connect.max.retries.on.timeouts</name>
            <value>2</value>
          </property>
        
          <property>
            <name>dfs.socket.timeout</name>
            <value>5000</value>
          </property>
        
          <property>
            <name>dfs.socket.write.timeout</name>
            <value>5000</value>
          </property>
        
          <property>
            <name>ipc.ping.interval</name>
            <value>20000</value>
          </property>
        
          <property>
            <name>io.file.buffer.size</name>
            <value>65536</value>
          </property>
        
          <property>
            <name>ipc.client.connect.max.retries</name>
            <value>10</value>
          </property>
        
          <property>
            <name>ipc.client.tcpnodelay</name>
            <value>true</value>
          </property>
        
          <property>
            <name>ipc.server.tcpnodelay</name>
            <value>true</value>
          </property>
        
          <property>
            <name>dfs.domain.socket.path</name>
            <value>/var/lib/hadoop/dn_socket._PORT</value>
          </property>
          
          <property>
            <name>dfs.client.read.shortcircuit</name>
            <value>true</value>
          </property>
        
          <property>
            <name>dfs.client.file-block-storage-locations.timeout</name>
            <value>3000</value>
          </property>
        
          <property>
             <name>dfs.client.read.shortcircuit.skip.checksum</name>
             <value>true</value>
           </property>
        
        </configuration>
        

        hdfs-site.xml

        <?xml version="1.0"?>
        <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
        <configuration>
        
          <property>
            <name>dfs.datanode.handler.count</name>
            <!-- default 10 -->
            <value>32</value>
            <description>The number of server threads for the
            datanode.</description>
          </property>
        
          <property>
            <name>dfs.namenode.handler.count</name>
            <!-- default 10 -->
            <value>32</value>
            <description>The number of server threads for the
            namenode.</description>
          </property>
        
          <property>
            <name>dfs.block.size</name>
            <value>134217728</value>
            <description>The default block size for new files.</description>
          </property>
        
          <property>
            <name>dfs.datanode.max.xcievers</name>
            <value>4098</value>
          </property>
        
          <property>
            <name>dfs.namenode.replication.interval</name>
            <value>15</value>
          </property>
        
          <property>
            <name>dfs.balance.bandwidthPerSec</name>
            <value>10485760</value>
          </property>
        
          <property>
            <name>fs.checkpoint.dir</name>
            <value>${hadoop.data.dir1}/dfs/namesecondary</value>
          </property>
        
          <property>
            <name>dfs.name.dir</name>
            <value>${hadoop.data.dir0}/dfs/name</value>
          </property>
        
          <property>
            <name>dfs.data.dir</name>
            <value>${hadoop.data.dir0}/dfs/data,${hadoop.data.dir1}/dfs/data,${hadoop.data.dir2}/dfs/data,${hadoop.data.dir3}/dfs/data,${hadoop.data.dir4}/dfs/data,${hadoop.data.dir5}/dfs/data</value>
          </property>
        
          <property>
            <name>dfs.datanode.socket.write.timeout</name>
            <value>10000</value>
          </property>
        
          <property>
            <name>ipc.client.connect.timeout</name>
            <value>1000</value>
          </property>
        
          <property>
            <name>ipc.client.connect.max.retries.on.timeouts</name>
            <value>2</value>
          </property>
        
          <property>
            <name>dfs.socket.timeout</name>
            <value>5000</value>
          </property>
        
          <property>
            <name>dfs.socket.write.timeout</name>
            <value>5000</value>
          </property>
        
          <property>
            <name>dfs.domain.socket.path</name>
            <value>/var/lib/hadoop/dn_socket._PORT</value>
          </property>
        
          <property>
            <name>dfs.block.local-path-access.user</name>
            <value>hbase</value>
          </property>
        
          <property>
             <name>dfs.client.read.shortcircuit.skip.checksum</name>
             <value>true</value>
           </property>
        
          <property>
            <name>dfs.client.file-block-storage-locations.timeout</name>
            <value>3000</value>
          </property>
        
        </configuration>
        

        yarn-site.xml:

        <?xml version="1.0"?>
        <configuration>
          <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce.shuffle</value>
          </property>
          <property>
            <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
          </property>
          <property>
            <name>yarn.resourcemanager.address</name>
            <value>${MASTER_HOSTNAME}:10040</value>
            <description>In Server specified the port that Resource Manager will runn on. In client is used for connecting to Resource Manager</description>
          </property>
          <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value>${MASTER_HOSTNAME}:8025</value>
            <description>Utilized by Node Manager for communication with Resource Manager</description>
          </property>
          <property>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value>${MASTER_HOSTNAME}:8030</value>
             <description>Utilized by Application Masters to communicate with Resource Manager; in our case for MRAppMaster (MapReduce Application Master) to communicate with Resource Manager</description>
          </property>
          <property>
            <name>yarn.resourcemanager.admin.address</name>
            <value>${MASTER_HOSTNAME}:8141</value>
             <description>Utilized by administrative clients ($yarn rmadmin) to communicate with Resource Manager</description>
          </property>
        
          <property>
             <name>yarn.nodemanager.local-dirs</name>
             <value>${hadoop.data.dir0}/mapred/nodemanager,${hadoop.data.dir1}/mapred/nodemanager,${hadoop.data.dir2}/mapred/nodemanager,${hadoop.data.dir3}/mapred/nodemanager,${hadoop.data.dir4}/mapred/nodemanager,${hadoop.data.dir5}/mapred/nodemanager</value>
             <final>true</final>
             <description>Comma separated list of directories, where local data is persisted by Node Manager</description>
          </property>
        
          <property>
             <name>yarn.nodemanager.aux-services</name>
             <value>mapreduce.shuffle</value>
             <description>Long running service which executes on Node Manager(s) and provides MapReduce Sort and Shuffle functionality.</description>
          </property>
        
          <property>
             <name>yarn.log-aggregation-enable</name>
             <value>true</value>
            <description>Enable log aggregation so application logs are moved onto hdfs and are viewable via web ui after the application completed. The default location on hdfs is '/log' and can be changed via yarn.nodemanager.remote-app-log-dir property</description>
          </property>
        
         <property>
             <name>hadoop.security.authorization</name>
             <value>false</value>
             <description>Disable authorization fo development and clusters that do not require security</description>
           </property>
        
          <property>
            <description>Amount of physical memory, in MB, that can be allocated for containers.</description>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>9000</value>
          </property>
        </configuration>
        

        hbase-site.xml:

        <?xml version="1.0"?>
        <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
        <configuration>
          <property>
            <name>hbase.zookeeper.quorum</name>
            <value>${MASTER_HOSTNAME}</value>
            <description>The directory shared by RegionServers.
            </description>
          </property>
        
          <property>
            <name>hbase.rootdir</name>
            <value>hdfs://${MASTER_HOSTNAME}:8020/hbase</value>
          </property>
        
          <property>
            <name>hbase.cluster.distributed</name>
            <value>true</value>
          </property>
        
          <property>
            <name>dfs.client.read.shortcircuit</name>
            <value>true</value>
          </property>
        
          <property>
            <name>dfs.block.local-path-access.user</name>
            <value>hbase</value>
          </property>
        
          <property>
            <name>dfs.domain.socket.path</name>
            <value>/var/lib/hadoop/dn_socket._PORT</value>
          </property>
        
          <property>
             <name>dfs.client.read.shortcircuit.skip.checksum</name>
             <value>true</value>
           </property>
        
           <property>
             <name>hbase.IntegrationTestDataIngestSlowDeterministic.runtime</name>
             <value>3600000</value>
           </property>
        </configuration>
        
        Show
        Elliott Clark added a comment - Configs on the cluster: core-site.xml <?xml version= "1.0" ?> <?xml-stylesheet type= "text/xsl" href= "configuration.xsl" ?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs. default .name</name> <value>hdfs: //${MASTER_HOSTNAME}:8020/</value> </property> <property> <name>hadoop.data.dir0</name> <value>/data0/</value> </property> <property> <name>hadoop.data.dir1</name> <value>/data1/</value> </property> <property> <name>hadoop.data.dir2</name> <value>/data2/</value> </property> <property> <name>hadoop.data.dir3</name> <value>/data3/</value> </property> <property> <name>hadoop.data.dir4</name> <value>/data4/</value> </property> <property> <name>hadoop.data.dir5</name> <value>/data5/</value> </property> <property> <name>mapred.temp.dir</name> <value>${hadoop.data.dir1}/mapred/temp</value> <description>A shared directory for temporary files. </description> </property> <property> <name>ipc.client.connect.timeout</name> <value>1000</value> </property> <property> <name>ipc.client.connect.max.retries.on.timeouts</name> <value>2</value> </property> <property> <name>dfs.socket.timeout</name> <value>5000</value> </property> <property> <name>dfs.socket.write.timeout</name> <value>5000</value> </property> <property> <name>ipc.ping.interval</name> <value>20000</value> </property> <property> <name>io.file.buffer.size</name> <value>65536</value> </property> <property> <name>ipc.client.connect.max.retries</name> <value>10</value> </property> <property> <name>ipc.client.tcpnodelay</name> <value> true </value> </property> <property> <name>ipc.server.tcpnodelay</name> <value> true </value> </property> <property> <name>dfs.domain.socket.path</name> <value>/ var /lib/hadoop/dn_socket._PORT</value> </property> <property> <name>dfs.client.read.shortcircuit</name> <value> true </value> </property> <property> <name>dfs.client.file-block-storage-locations.timeout</name> <value>3000</value> </property> <property> <name>dfs.client.read.shortcircuit.skip.checksum</name> <value> true </value> </property> </configuration> hdfs-site.xml <?xml version= "1.0" ?> <?xml-stylesheet type= "text/xsl" href= "configuration.xsl" ?> <configuration> <property> <name>dfs.datanode.handler.count</name> <!-- default 10 --> <value>32</value> <description>The number of server threads for the datanode.</description> </property> <property> <name>dfs.namenode.handler.count</name> <!-- default 10 --> <value>32</value> <description>The number of server threads for the namenode.</description> </property> <property> <name>dfs.block.size</name> <value>134217728</value> <description>The default block size for new files.</description> </property> <property> <name>dfs.datanode.max.xcievers</name> <value>4098</value> </property> <property> <name>dfs.namenode.replication.interval</name> <value>15</value> </property> <property> <name>dfs.balance.bandwidthPerSec</name> <value>10485760</value> </property> <property> <name>fs.checkpoint.dir</name> <value>${hadoop.data.dir1}/dfs/namesecondary</value> </property> <property> <name>dfs.name.dir</name> <value>${hadoop.data.dir0}/dfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>${hadoop.data.dir0}/dfs/data,${hadoop.data.dir1}/dfs/data,${hadoop.data.dir2}/dfs/data,${hadoop.data.dir3}/dfs/data,${hadoop.data.dir4}/dfs/data,${hadoop.data.dir5}/dfs/data</value> </property> <property> <name>dfs.datanode.socket.write.timeout</name> <value>10000</value> </property> <property> <name>ipc.client.connect.timeout</name> <value>1000</value> </property> <property> <name>ipc.client.connect.max.retries.on.timeouts</name> <value>2</value> </property> <property> <name>dfs.socket.timeout</name> <value>5000</value> </property> <property> <name>dfs.socket.write.timeout</name> <value>5000</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/ var /lib/hadoop/dn_socket._PORT</value> </property> <property> <name>dfs.block.local-path-access.user</name> <value>hbase</value> </property> <property> <name>dfs.client.read.shortcircuit.skip.checksum</name> <value> true </value> </property> <property> <name>dfs.client.file-block-storage-locations.timeout</name> <value>3000</value> </property> </configuration> yarn-site.xml: <?xml version= "1.0" ?> <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>${MASTER_HOSTNAME}:10040</value> <description>In Server specified the port that Resource Manager will runn on. In client is used for connecting to Resource Manager</description> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>${MASTER_HOSTNAME}:8025</value> <description>Utilized by Node Manager for communication with Resource Manager</description> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>${MASTER_HOSTNAME}:8030</value> <description>Utilized by Application Masters to communicate with Resource Manager; in our case for MRAppMaster (MapReduce Application Master) to communicate with Resource Manager</description> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>${MASTER_HOSTNAME}:8141</value> <description>Utilized by administrative clients ($yarn rmadmin) to communicate with Resource Manager</description> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>${hadoop.data.dir0}/mapred/nodemanager,${hadoop.data.dir1}/mapred/nodemanager,${hadoop.data.dir2}/mapred/nodemanager,${hadoop.data.dir3}/mapred/nodemanager,${hadoop.data.dir4}/mapred/nodemanager,${hadoop.data.dir5}/mapred/nodemanager</value> < final > true </ final > <description>Comma separated list of directories, where local data is persisted by Node Manager</description> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> <description> Long running service which executes on Node Manager(s) and provides MapReduce Sort and Shuffle functionality.</description> </property> <property> <name>yarn.log-aggregation-enable</name> <value> true </value> <description>Enable log aggregation so application logs are moved onto hdfs and are viewable via web ui after the application completed. The default location on hdfs is '/log' and can be changed via yarn.nodemanager.remote-app-log-dir property</description> </property> <property> <name>hadoop.security.authorization</name> <value> false </value> <description>Disable authorization fo development and clusters that do not require security</description> </property> <property> <description>Amount of physical memory, in MB, that can be allocated for containers.</description> <name>yarn.nodemanager.resource.memory-mb</name> <value>9000</value> </property> </configuration> hbase-site.xml: <?xml version= "1.0" ?> <?xml-stylesheet type= "text/xsl" href= "configuration.xsl" ?> <configuration> <property> <name>hbase.zookeeper.quorum</name> <value>${MASTER_HOSTNAME}</value> <description>The directory shared by RegionServers. </description> </property> <property> <name>hbase.rootdir</name> <value>hdfs: //${MASTER_HOSTNAME}:8020/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value> true </value> </property> <property> <name>dfs.client.read.shortcircuit</name> <value> true </value> </property> <property> <name>dfs.block.local-path-access.user</name> <value>hbase</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/ var /lib/hadoop/dn_socket._PORT</value> </property> <property> <name>dfs.client.read.shortcircuit.skip.checksum</name> <value> true </value> </property> <property> <name>hbase.IntegrationTestDataIngestSlowDeterministic.runtime</name> <value>3600000</value> </property> </configuration>
        Hide
        stack added a comment -

        This is critical if not a blocker I'd say.

        Show
        stack added a comment - This is critical if not a blocker I'd say.

          People

          • Assignee:
            Elliott Clark
            Reporter:
            Elliott Clark
          • Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development