Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.5.1
    • Fix Version/s: 1.5.2, 1.6.1, 1.7.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      accumulo-1.5.0 with sqrrl extensions.

      Description

      Working with John Vines on this one...

      Setup: Created a couple tables, added some data, then dropped them. The drop hangs and !METADATA (which has ~400 entries) is scanned in what looks like an infinite loop.

      The table being dropped loks like this in !METADATA:

      root@sqrrl> scan -b 3 -e 5 -t !METADATA
      4;\x00\x00\x06f srv:dir []    /t-00000b6
      4;\x00\x00\x06f srv:lock []    tservers/10.10.1.209:9997/zlock-0000000000$144274cd317000b
      4;\x00\x00\x06f srv:time []    M0
      4;\x00\x00\x06f ~tab:~pr []    \x00
      4;\x00\x00\x0C\xCC loc:144274cd3170008 []    10.10.1.107:9997
      4;\x00\x00\x0C\xCC srv:dir []    /t-00000bj
      4;\x00\x00\x0C\xCC srv:lock []    tservers/10.10.1.209:9997/zlock-0000000000$144274cd317000b
      4;\x00\x00\x0C\xCC srv:time []    M0
      4;\x00\x00\x0C\xCC ~tab:~pr []    \x01\x00\x00\x06f
      4;\x00\x00\x133 srv:dir []    /t-000002h
      4;\x00\x00\x133 srv:lock []    tservers/10.10.1.209:9997/zlock-0000000000$144274cd317000b
      4;\x00\x00\x133 srv:time []    M0
      4;\x00\x00\x133 ~tab:~pr []    \x01\x00\x00\x0C\xCC
      

      We think this may be the relevant message in the master debug logs:

      2014-02-12 19:13:31,397 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: 4 saw inconsistencynull 4;^@^@^L?;^@^@^Ff
      2014-02-12 19:13:31,459 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: 4 saw inconsistencynull 4;^@^@^L?;^@^@^Ff
      2014-02-12 19:13:31,524 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: 4 saw inconsistencynull 4;^@^@^L?;^@^@^Ff
      2014-02-12 19:13:31,588 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: 4 saw inconsistencynull 4;^@^@^L?;^@^@^Ff
      2014-02-12 19:13:31,662 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: 4 saw inconsistencynull 4;^@^@^L?;^@^@^Ff
      2014-02-12 19:13:31,725 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: 4 saw inconsistencynull 4;^@^@^L?;^@^@^Ff
      2014-02-12 19:13:31,788 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: 4 saw inconsistencynull 4;^@^@^L?;^@^@^Ff
      2014-02-12 19:13:31,854 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: 4 saw inconsistencynull 4;^@^@^L?;^@^@^Ff
      2014-02-12 19:13:31,917 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: 4 saw inconsistencynull 4;^@^@^L?;^@^@^Ff
      ...etc
      

      Graceful accumulo reboot hangs.

      Hard reboot of everything (control-c'd) clears the problem.

      1. jstack3.txt
        188 kB
        Chris McCubbin
      2. jstack1.txt
        153 kB
        Chris McCubbin
      3. jstack2.txt
        165 kB
        Chris McCubbin
      4. masterJstack.txt
        42 kB
        Chris McCubbin
      5. masterJstack2.txt
        39 kB
        Chris McCubbin
      6. masterJstack3.txt
        43 kB
        Chris McCubbin
      7. Screen Shot 2014-02-12 at 2.28.09 PM.png
        90 kB
        Chris McCubbin
      8. Screen Shot 2014-02-12 at 2.27.11 PM.png
        86 kB
        Chris McCubbin
      9. short.debug.log
        21 kB
        Chris McCubbin
      10. AccumuloHammerTableCreation.java
        6 kB
        John Vines
      11. AccumuloHammerTableCreation.java
        5 kB
        John Vines
      12. TableCreationHammerIT.java
        7 kB
        Eric Newton

        Issue Links

          Activity

          Hide
          Eric Newton added a comment -

          Translated John Vines description into an Integration Test, which fails consistently for me under 1.6.0-SNAPSHOT.

          Show
          Eric Newton added a comment - Translated John Vines description into an Integration Test, which fails consistently for me under 1.6.0-SNAPSHOT.
          Hide
          John Vines added a comment -

          Easier method, with this code. While the tables are being created, I ran
          kill -9 TSERVERPID; bin/start-here.sh; bin/accumulo shell -u root -p secret -e "droptable -p TABLE.* -f"

          After a few iterations (5), it broke the same way.

          Show
          John Vines added a comment - Easier method, with this code. While the tables are being created, I ran kill -9 TSERVERPID; bin/start-here.sh; bin/accumulo shell -u root -p secret -e "droptable -p TABLE.* -f" After a few iterations (5), it broke the same way.
          Hide
          John Vines added a comment -

          After about 3 hours against 1.5.1, the following code reproduced the issue. This was after a few hundred messages about overlapping ranges. This finally happened when the tserver died, so I killed this process, restarted the tserver, and attempted to drop all of the TABLE tables from the shell.

          Show
          John Vines added a comment - After about 3 hours against 1.5.1, the following code reproduced the issue. This was after a few hundred messages about overlapping ranges. This finally happened when the tserver died, so I killed this process, restarted the tserver, and attempted to drop all of the TABLE tables from the shell.
          Hide
          Eric Newton added a comment -

          Your extensions may not be causing the problem. Please consider writing a test that demonstrates that it happens without 3rd party code.

          Show
          Eric Newton added a comment - Your extensions may not be causing the problem. Please consider writing a test that demonstrates that it happens without 3rd party code.
          Hide
          John Vines added a comment -

          https://issues.apache.org/jira/browse/ACCUMULO-2489 was spotted while grabbing debug information for this. This could be related to the mystery tablet that the master was trying to unassign?

          Show
          John Vines added a comment - https://issues.apache.org/jira/browse/ACCUMULO-2489 was spotted while grabbing debug information for this. This could be related to the mystery tablet that the master was trying to unassign?
          Hide
          John Vines added a comment -

          So I'm seeing this again, I think. It is still with sqrrl extensions, but I do believe it's irrelevant. Critical items to note-

          Repeatedly seeing

          2014-03-17 11:13:50,784 [tabletserver.TabletServer] INFO : told to unload tablet that was not being served 3tx;^@^@�;^@^@�" 

          Those are actually the same character-

          root@ip-10-10-1-20:/data0/logs/accumulo# tail tserver_ip-10-10-1-20.debug.log | grep -a "not being served" | tail -n 1 | hexdump -C
          00000000  32 30 31 34 2d 30 33 2d  31 37 20 31 36 3a 34 36  |2014-03-17 16:46|
          00000010  3a 35 35 2c 34 31 31 20  5b 74 61 62 6c 65 74 73  |:55,411 [tablets|
          00000020  65 72 76 65 72 2e 54 61  62 6c 65 74 53 65 72 76  |erver.TabletServ|
          00000030  65 72 5d 20 49 4e 46 4f  20 3a 20 74 6f 6c 64 20  |er] INFO : told |
          00000040  74 6f 20 75 6e 6c 6f 61  64 20 74 61 62 6c 65 74  |to unload tablet|
          00000050  20 74 68 61 74 20 77 61  73 20 6e 6f 74 20 62 65  | that was not be|
          00000060  69 6e 67 20 73 65 72 76  65 64 20 33 74 78 3b 00  |ing served 3tx;.|
          00000070  00 ef bf bd 3b 00 00 ef  bf bd 0a                 |....;......|
          0000007b
          

          So it's trying to use a key extent with \x00\x00\xef\\xbf\bd as both the end row and prev end row, which is weird enough and MAY be relevent (side question - how is this happening?) and that there's no sign of this \x00\x00\xEF\xBF\xBD tablet in the !METADATA table. - metadata table view-

          root@accumulo !METADATA> scan -b 3tx -e 3txa -c ~tab -t !METADATA -np
          3tx;\x00\x00\x06f ~tab:~pr []    \x00
          3tx;\x00\x00\x0C\xCC ~tab:~pr []    \x01\x00\x00\x06f
          3tx;\x00\x00\x133 ~tab:~pr []    \x01\x00\x00\x0C\xCC
          3tx;\x00\x00\x19\x99 ~tab:~pr []    \x01\x00\x00\x133
          3tx;\x00\x00  ~tab:~pr []    \x01\x00\x00\x19\x99
          3tx;\x00\x00&f ~tab:~pr []    \x01\x00\x00 
          3tx;\x00\x00,\xCC ~tab:~pr []    \x01\x00\x00&f
          3tx;\x00\x0033 ~tab:~pr []    \x01\x00\x00,\xCC
          3tx;\x00\x009\x99 ~tab:~pr []    \x01\x00\x0033
          3tx;\x00\x00@ ~tab:~pr []    \x01\x00\x009\x99
          3tx;\x00\x00Ff ~tab:~pr []    \x01\x00\x00@
          3tx;\x00\x00L\xCC ~tab:~pr []    \x01\x00\x00Ff
          3tx;\x00\x00S3 ~tab:~pr []    \x01\x00\x00L\xCC
          3tx;\x00\x00Y\x99 ~tab:~pr []    \x01\x00\x00S3
          3tx;\x00\x00` ~tab:~pr []    \x01\x00\x00Y\x99
          3tx;\x00\x00ff ~tab:~pr []    \x01\x00\x00`
          3tx;\x00\x00l\xCC ~tab:~pr []    \x01\x00\x00ff
          3tx;\x00\x00s3 ~tab:~pr []    \x01\x00\x00l\xCC
          3tx;\x00\x00y\x99 ~tab:~pr []    \x01\x00\x00s3
          3tx;\x00\x00\x80 ~tab:~pr []    \x01\x00\x00y\x99
          3tx;\x00\x00\x86f ~tab:~pr []    \x01\x00\x00\x80
          3tx;\x00\x00\x8C\xCC ~tab:~pr []    \x01\x00\x00\x86f
          3tx;\x00\x00\x933 ~tab:~pr []    \x01\x00\x00\x8C\xCC
          3tx;\x00\x00\x99\x99 ~tab:~pr []    \x01\x00\x00\x933
          3tx;\x00\x00\xA0 ~tab:~pr []    \x01\x00\x00\x99\x99
          3tx;\x00\x00\xA6f ~tab:~pr []    \x01\x00\x00\xA0
          3tx;\x00\x00\xAC\xCC ~tab:~pr []    \x01\x00\x00\xA6f
          3tx;\x00\x00\xB33 ~tab:~pr []    \x01\x00\x00\xAC\xCC
          3tx;\x00\x00\xB9\x99 ~tab:~pr []    \x01\x00\x00\xB33
          3tx;\x00\x00\xC0 ~tab:~pr []    \x01\x00\x00\xB9\x99
          3tx;\x00\x00\xC6f ~tab:~pr []    \x01\x00\x00\xC0
          3tx;\x00\x00\xCC\xCC ~tab:~pr []    \x01\x00\x00\xC6f
          3tx;\x00\x00\xD33 ~tab:~pr []    \x01\x00\x00\xCC\xCC
          3tx;\x00\x00\xD9\x99 ~tab:~pr []    \x01\x00\x00\xD33
          3tx;\x00\x00\xE0 ~tab:~pr []    \x01\x00\x00\xD9\x99
          3tx;\x00\x00\xE6f ~tab:~pr []    \x01\x00\x00\xE0
          3tx;\x00\x00\xEC\xCC ~tab:~pr []    \x01\x00\x00\xE6f
          3tx;\x00\x00\xF9\x99 ~tab:~pr []    \x01\x00\x00\xEC\xCC
          3tx;\x00\x01 ~tab:~pr []    \x01\x00\x00\xF9\x99
          3tx< ~tab:~pr []    \x01\x00\x01
          

          Tablet server logs-

          2014-03-15 03:25:40,038 [tabletserver.TabletServer] INFO : unloaded 3tx;^@^@�;^@^@ٙ
          2014-03-15 03:25:40,038 [tabletserver.Tablet] DEBUG: initiateClose(saveState=false queueMinC=false disableWrites=false) 3tx;^@^@�;^@^@�
          2014-03-15 03:25:40,038 [tabletserver.Tablet] DEBUG: completeClose(saveState=false completeClose=true) 3tx;^@^@�;^@^@�
          2014-03-15 03:25:40,038 [tabletserver.Tablet] TABLET_HIST: 3tx;^@^@�;^@^@� closed
          2014-03-15 03:25:40,038 [tabletserver.TabletServer] DEBUG: Unassigning 3tx;^@^@�;^@^@�@(null,10.10.1.20:9997[1449daf9ff50ddc],null)
          2014-03-15 03:25:40,042 [tabletserver.TabletServer] DEBUG: MultiScanSess 10.10.1.20:40511 2 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1) 
          2014-03-15 03:25:40,043 [tabletserver.TabletServer] INFO : unloaded 3tx;^@^@�;^@^@�
          2014-03-15 03:25:40,043 [tabletserver.TabletServer] INFO : told to unload tablet that was not being served 3tx;^@^@�;^@^@�
          2014-03-15 03:25:40,043 [tabletserver.TabletServer] INFO : told to unload tablet that was not being served 3tx;^@^@�;^@^@�
          

          Looks like a race between unassigning and reattempting to unassign, which is something we've fought with in the past.

          Bouncing the master does not resolve this.

          table loc scan shows

          root@accumulo> scan -b 3tx -e 3txa -c loc -t !METADATA
          3tx;\x00\x00\xEC\xCC loc:1449daf9ff50ddc []    10.10.1.20:9997
          3tx;\x00\x00\xF9\x99 loc:1449daf9ff50ddc []    10.10.1.20:9997
          

          Which is strange, since we only see one erroring unassignment but two tablets with locations.

          I then bounced the tserver and it resolved itself.

          Show
          John Vines added a comment - So I'm seeing this again, I think. It is still with sqrrl extensions, but I do believe it's irrelevant. Critical items to note- Repeatedly seeing 2014-03-17 11:13:50,784 [tabletserver.TabletServer] INFO : told to unload tablet that was not being served 3tx;^@^@�;^@^@�" Those are actually the same character- root@ip-10-10-1-20:/data0/logs/accumulo# tail tserver_ip-10-10-1-20.debug.log | grep -a "not being served" | tail -n 1 | hexdump -C 00000000 32 30 31 34 2d 30 33 2d 31 37 20 31 36 3a 34 36 |2014-03-17 16:46| 00000010 3a 35 35 2c 34 31 31 20 5b 74 61 62 6c 65 74 73 |:55,411 [tablets| 00000020 65 72 76 65 72 2e 54 61 62 6c 65 74 53 65 72 76 |erver.TabletServ| 00000030 65 72 5d 20 49 4e 46 4f 20 3a 20 74 6f 6c 64 20 |er] INFO : told | 00000040 74 6f 20 75 6e 6c 6f 61 64 20 74 61 62 6c 65 74 |to unload tablet| 00000050 20 74 68 61 74 20 77 61 73 20 6e 6f 74 20 62 65 | that was not be| 00000060 69 6e 67 20 73 65 72 76 65 64 20 33 74 78 3b 00 |ing served 3tx;.| 00000070 00 ef bf bd 3b 00 00 ef bf bd 0a |....;......| 0000007b So it's trying to use a key extent with \x00\x00\xef\\xbf\bd as both the end row and prev end row, which is weird enough and MAY be relevent (side question - how is this happening?) and that there's no sign of this \x00\x00\xEF\xBF\xBD tablet in the !METADATA table. - metadata table view- root@accumulo !METADATA> scan -b 3tx -e 3txa -c ~tab -t !METADATA -np 3tx;\x00\x00\x06f ~tab:~pr [] \x00 3tx;\x00\x00\x0C\xCC ~tab:~pr [] \x01\x00\x00\x06f 3tx;\x00\x00\x133 ~tab:~pr [] \x01\x00\x00\x0C\xCC 3tx;\x00\x00\x19\x99 ~tab:~pr [] \x01\x00\x00\x133 3tx;\x00\x00 ~tab:~pr [] \x01\x00\x00\x19\x99 3tx;\x00\x00&f ~tab:~pr [] \x01\x00\x00 3tx;\x00\x00,\xCC ~tab:~pr [] \x01\x00\x00&f 3tx;\x00\x0033 ~tab:~pr [] \x01\x00\x00,\xCC 3tx;\x00\x009\x99 ~tab:~pr [] \x01\x00\x0033 3tx;\x00\x00@ ~tab:~pr [] \x01\x00\x009\x99 3tx;\x00\x00Ff ~tab:~pr [] \x01\x00\x00@ 3tx;\x00\x00L\xCC ~tab:~pr [] \x01\x00\x00Ff 3tx;\x00\x00S3 ~tab:~pr [] \x01\x00\x00L\xCC 3tx;\x00\x00Y\x99 ~tab:~pr [] \x01\x00\x00S3 3tx;\x00\x00` ~tab:~pr [] \x01\x00\x00Y\x99 3tx;\x00\x00ff ~tab:~pr [] \x01\x00\x00` 3tx;\x00\x00l\xCC ~tab:~pr [] \x01\x00\x00ff 3tx;\x00\x00s3 ~tab:~pr [] \x01\x00\x00l\xCC 3tx;\x00\x00y\x99 ~tab:~pr [] \x01\x00\x00s3 3tx;\x00\x00\x80 ~tab:~pr [] \x01\x00\x00y\x99 3tx;\x00\x00\x86f ~tab:~pr [] \x01\x00\x00\x80 3tx;\x00\x00\x8C\xCC ~tab:~pr [] \x01\x00\x00\x86f 3tx;\x00\x00\x933 ~tab:~pr [] \x01\x00\x00\x8C\xCC 3tx;\x00\x00\x99\x99 ~tab:~pr [] \x01\x00\x00\x933 3tx;\x00\x00\xA0 ~tab:~pr [] \x01\x00\x00\x99\x99 3tx;\x00\x00\xA6f ~tab:~pr [] \x01\x00\x00\xA0 3tx;\x00\x00\xAC\xCC ~tab:~pr [] \x01\x00\x00\xA6f 3tx;\x00\x00\xB33 ~tab:~pr [] \x01\x00\x00\xAC\xCC 3tx;\x00\x00\xB9\x99 ~tab:~pr [] \x01\x00\x00\xB33 3tx;\x00\x00\xC0 ~tab:~pr [] \x01\x00\x00\xB9\x99 3tx;\x00\x00\xC6f ~tab:~pr [] \x01\x00\x00\xC0 3tx;\x00\x00\xCC\xCC ~tab:~pr [] \x01\x00\x00\xC6f 3tx;\x00\x00\xD33 ~tab:~pr [] \x01\x00\x00\xCC\xCC 3tx;\x00\x00\xD9\x99 ~tab:~pr [] \x01\x00\x00\xD33 3tx;\x00\x00\xE0 ~tab:~pr [] \x01\x00\x00\xD9\x99 3tx;\x00\x00\xE6f ~tab:~pr [] \x01\x00\x00\xE0 3tx;\x00\x00\xEC\xCC ~tab:~pr [] \x01\x00\x00\xE6f 3tx;\x00\x00\xF9\x99 ~tab:~pr [] \x01\x00\x00\xEC\xCC 3tx;\x00\x01 ~tab:~pr [] \x01\x00\x00\xF9\x99 3tx< ~tab:~pr [] \x01\x00\x01 Tablet server logs- 2014-03-15 03:25:40,038 [tabletserver.TabletServer] INFO : unloaded 3tx;^@^@�;^@^@ٙ 2014-03-15 03:25:40,038 [tabletserver.Tablet] DEBUG: initiateClose(saveState= false queueMinC= false disableWrites= false ) 3tx;^@^@�;^@^@� 2014-03-15 03:25:40,038 [tabletserver.Tablet] DEBUG: completeClose(saveState= false completeClose= true ) 3tx;^@^@�;^@^@� 2014-03-15 03:25:40,038 [tabletserver.Tablet] TABLET_HIST: 3tx;^@^@�;^@^@� closed 2014-03-15 03:25:40,038 [tabletserver.TabletServer] DEBUG: Unassigning 3tx;^@^@�;^@^@�@( null ,10.10.1.20:9997[1449daf9ff50ddc], null ) 2014-03-15 03:25:40,042 [tabletserver.TabletServer] DEBUG: MultiScanSess 10.10.1.20:40511 2 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1) 2014-03-15 03:25:40,043 [tabletserver.TabletServer] INFO : unloaded 3tx;^@^@�;^@^@� 2014-03-15 03:25:40,043 [tabletserver.TabletServer] INFO : told to unload tablet that was not being served 3tx;^@^@�;^@^@� 2014-03-15 03:25:40,043 [tabletserver.TabletServer] INFO : told to unload tablet that was not being served 3tx;^@^@�;^@^@� Looks like a race between unassigning and reattempting to unassign, which is something we've fought with in the past. Bouncing the master does not resolve this. table loc scan shows root@accumulo> scan -b 3tx -e 3txa -c loc -t !METADATA 3tx;\x00\x00\xEC\xCC loc:1449daf9ff50ddc [] 10.10.1.20:9997 3tx;\x00\x00\xF9\x99 loc:1449daf9ff50ddc [] 10.10.1.20:9997 Which is strange, since we only see one erroring unassignment but two tablets with locations. I then bounced the tserver and it resolved itself.
          Hide
          ASF subversion and git services added a comment -

          Commit 8ca088e6a37aa57181aaff63b749a7e7c5deb203 in accumulo's branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=8ca088e ]

          ACCUMULO-2361 clean-up at the end of the test

          Show
          ASF subversion and git services added a comment - Commit 8ca088e6a37aa57181aaff63b749a7e7c5deb203 in accumulo's branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=8ca088e ] ACCUMULO-2361 clean-up at the end of the test
          Hide
          ASF subversion and git services added a comment -

          Commit e0b8fb0e98c1b38c4fa2e20ae0611e33e07ecba6 in accumulo's branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=e0b8fb0 ]

          ACCUMULO-2361 clean-up: remove commented code

          Show
          ASF subversion and git services added a comment - Commit e0b8fb0e98c1b38c4fa2e20ae0611e33e07ecba6 in accumulo's branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=e0b8fb0 ] ACCUMULO-2361 clean-up: remove commented code
          Hide
          ASF subversion and git services added a comment -

          Commit 8ca088e6a37aa57181aaff63b749a7e7c5deb203 in accumulo's branch refs/heads/1.6.0-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=8ca088e ]

          ACCUMULO-2361 clean-up at the end of the test

          Show
          ASF subversion and git services added a comment - Commit 8ca088e6a37aa57181aaff63b749a7e7c5deb203 in accumulo's branch refs/heads/1.6.0-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=8ca088e ] ACCUMULO-2361 clean-up at the end of the test
          Hide
          ASF subversion and git services added a comment -

          Commit e0b8fb0e98c1b38c4fa2e20ae0611e33e07ecba6 in accumulo's branch refs/heads/1.6.0-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=e0b8fb0 ]

          ACCUMULO-2361 clean-up: remove commented code

          Show
          ASF subversion and git services added a comment - Commit e0b8fb0e98c1b38c4fa2e20ae0611e33e07ecba6 in accumulo's branch refs/heads/1.6.0-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=e0b8fb0 ] ACCUMULO-2361 clean-up: remove commented code
          Hide
          ASF subversion and git services added a comment -

          Commit 8ca088e6a37aa57181aaff63b749a7e7c5deb203 in accumulo's branch refs/heads/1.5.1-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=8ca088e ]

          ACCUMULO-2361 clean-up at the end of the test

          Show
          ASF subversion and git services added a comment - Commit 8ca088e6a37aa57181aaff63b749a7e7c5deb203 in accumulo's branch refs/heads/1.5.1-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=8ca088e ] ACCUMULO-2361 clean-up at the end of the test
          Hide
          ASF subversion and git services added a comment -

          Commit 491346ac1bc3795f3610d5c16fc209d1421b5a48 in accumulo's branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=491346a ]

          ACCUMULO-2361 integration test should be int src/test/java

          Show
          ASF subversion and git services added a comment - Commit 491346ac1bc3795f3610d5c16fc209d1421b5a48 in accumulo's branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=491346a ] ACCUMULO-2361 integration test should be int src/test/java
          Hide
          Eric Newton added a comment -

          Please reopen if you can reproduce on vanilla Apache Accumulo.

          Show
          Eric Newton added a comment - Please reopen if you can reproduce on vanilla Apache Accumulo.
          Hide
          Chris McCubbin added a comment -

          And after that, the log is flooded with those "2014-02-10 12:50:58,338 [tabletserver.TabletServer] DEBUG: Got unloadTablet message from user: !SYSTEM" messages, one every couple of ms.

          Show
          Chris McCubbin added a comment - And after that, the log is flooded with those "2014-02-10 12:50:58,338 [tabletserver.TabletServer] DEBUG: Got unloadTablet message from user: !SYSTEM" messages, one every couple of ms.
          Hide
          Eric Newton added a comment -

          I haven't worked through all the implications, but doing scan during prepareMutationsForCommit might not work. Since this is a sqrrl-specific extension, I'm not going to put any more time into it. Perhaps you can adopt the integration test I added to reproduce the problem with your full code base.

          Show
          Eric Newton added a comment - I haven't worked through all the implications, but doing scan during prepareMutationsForCommit might not work. Since this is a sqrrl-specific extension, I'm not going to put any more time into it. Perhaps you can adopt the integration test I added to reproduce the problem with your full code base.
          Hide
          Chris McCubbin added a comment -

          Here's a snip of the debug log...I'm actually not sure if this is from the master or tserver. But this is from right before we started seeing strange behavior.

          Show
          Chris McCubbin added a comment - Here's a snip of the debug log...I'm actually not sure if this is from the master or tserver. But this is from right before we started seeing strange behavior.
          Hide
          Chris McCubbin added a comment -

          I don't think I have a full trace from the master, but I may have a snippet that I made from when it looked like the issue started.

          I had requested a shutdown well after the issue started, that's when I took the jstacks.

          Show
          Chris McCubbin added a comment - I don't think I have a full trace from the master, but I may have a snippet that I made from when it looked like the issue started. I had requested a shutdown well after the issue started, that's when I took the jstacks.
          Hide
          Eric Newton added a comment -

          Any chance you have the master debug logs from this?

          I see a shutdown is being requested in the master jstacks.

          Did you notice the problem and attempt a shutdown? Or did you request a shutdown after noticing the problem?

          Show
          Eric Newton added a comment - Any chance you have the master debug logs from this? I see a shutdown is being requested in the master jstacks. Did you notice the problem and attempt a shutdown? Or did you request a shutdown after noticing the problem?
          Hide
          Chris McCubbin added a comment -

          Oops, I meant we were able to scan !METADATA.

          Show
          Chris McCubbin added a comment - Oops, I meant we were able to scan !METADATA.
          Hide
          Chris McCubbin added a comment -

          Also to answer your question Eric Newton : We were able to scan the master during this.

          Show
          Chris McCubbin added a comment - Also to answer your question Eric Newton : We were able to scan the master during this.
          Hide
          ASF subversion and git services added a comment -

          Commit e0b8fb0e98c1b38c4fa2e20ae0611e33e07ecba6 in accumulo's branch refs/heads/1.5.1-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=e0b8fb0 ]

          ACCUMULO-2361 clean-up: remove commented code

          Show
          ASF subversion and git services added a comment - Commit e0b8fb0e98c1b38c4fa2e20ae0611e33e07ecba6 in accumulo's branch refs/heads/1.5.1-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=e0b8fb0 ] ACCUMULO-2361 clean-up: remove commented code
          Hide
          Chris McCubbin added a comment -

          Here's the output of some listscans on the monitor:

          root@sqrrl> listscans
           TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  | USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
          root@sqrrl> listscans
           TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  | USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
          root@sqrrl> listscans
           TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  | USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
          root@sqrrl> listscans
           TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  | USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
          root@sqrrl> listscans
           TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  | USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
               10.10.1.107:9997 |    10.10.1.206:41386 |      3ms |        - |   IDLE | BATCH | !SYSTEM |!METADATA |[loc::, chopped::, log::, ~tab:~pr:, future::] |                     |       N/A |[wholeRows=1000,org.apache.accumulo.core.iterators.user.WholeRowIterator, tabletChange=1001,org.apache.accumulo.server.master.state.TabletStateChangeIterator] | {wholeRows={}, tabletChange={tables=2,1,!0,5, merges=, servers=10.10.1.60:9997[144274cd3170007],10.10.1.209:9997[144274cd317000b],10.10.1.129:9997[144274cd317000a],10.10.1.114:9997[144274cd3170009],10.10.1.107:9997[144274cd3170008]}}
          root@sqrrl> listscans
           TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  | USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
          root@sqrrl> listscans
           TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  | USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
               10.10.1.209:9997 |    10.10.1.206:55270 |      1ms |        - |   IDLE | BATCH | !SYSTEM |!METADATA |[loc::, chopped::, log::, ~tab:~pr:, future::] |                     |       N/A |[wholeRows=1000,org.apache.accumulo.core.iterators.user.WholeRowIterator, tabletChange=1001,org.apache.accumulo.server.master.state.TabletStateChangeIterator] | {wholeRows={}, tabletChange={tables=2,1,!0,5, merges=, servers=10.10.1.60:9997[144274cd3170007],10.10.1.209:9997[144274cd317000b],10.10.1.129:9997[144274cd317000a],10.10.1.114:9997[144274cd3170009],10.10.1.107:9997[144274cd3170008]}}
          root@sqrrl> 
          
          Show
          Chris McCubbin added a comment - Here's the output of some listscans on the monitor: root@sqrrl> listscans TABLET SERVER | CLIENT | AGE | LAST | STATE | TYPE | USER | TABLE | COLUMNS | AUTHORIZATIONS | TABLET | ITERATORS | ITERATOR OPTIONS root@sqrrl> listscans TABLET SERVER | CLIENT | AGE | LAST | STATE | TYPE | USER | TABLE | COLUMNS | AUTHORIZATIONS | TABLET | ITERATORS | ITERATOR OPTIONS root@sqrrl> listscans TABLET SERVER | CLIENT | AGE | LAST | STATE | TYPE | USER | TABLE | COLUMNS | AUTHORIZATIONS | TABLET | ITERATORS | ITERATOR OPTIONS root@sqrrl> listscans TABLET SERVER | CLIENT | AGE | LAST | STATE | TYPE | USER | TABLE | COLUMNS | AUTHORIZATIONS | TABLET | ITERATORS | ITERATOR OPTIONS root@sqrrl> listscans TABLET SERVER | CLIENT | AGE | LAST | STATE | TYPE | USER | TABLE | COLUMNS | AUTHORIZATIONS | TABLET | ITERATORS | ITERATOR OPTIONS 10.10.1.107:9997 | 10.10.1.206:41386 | 3ms | - | IDLE | BATCH | !SYSTEM |!METADATA |[loc::, chopped::, log::, ~tab:~pr:, future ::] | | N/A |[wholeRows=1000,org.apache.accumulo.core.iterators.user.WholeRowIterator, tabletChange=1001,org.apache.accumulo.server.master.state.TabletStateChangeIterator] | {wholeRows={}, tabletChange={tables=2,1,!0,5, merges=, servers=10.10.1.60:9997[144274cd3170007],10.10.1.209:9997[144274cd317000b],10.10.1.129:9997[144274cd317000a],10.10.1.114:9997[144274cd3170009],10.10.1.107:9997[144274cd3170008]}} root@sqrrl> listscans TABLET SERVER | CLIENT | AGE | LAST | STATE | TYPE | USER | TABLE | COLUMNS | AUTHORIZATIONS | TABLET | ITERATORS | ITERATOR OPTIONS root@sqrrl> listscans TABLET SERVER | CLIENT | AGE | LAST | STATE | TYPE | USER | TABLE | COLUMNS | AUTHORIZATIONS | TABLET | ITERATORS | ITERATOR OPTIONS 10.10.1.209:9997 | 10.10.1.206:55270 | 1ms | - | IDLE | BATCH | !SYSTEM |!METADATA |[loc::, chopped::, log::, ~tab:~pr:, future ::] | | N/A |[wholeRows=1000,org.apache.accumulo.core.iterators.user.WholeRowIterator, tabletChange=1001,org.apache.accumulo.server.master.state.TabletStateChangeIterator] | {wholeRows={}, tabletChange={tables=2,1,!0,5, merges=, servers=10.10.1.60:9997[144274cd3170007],10.10.1.209:9997[144274cd317000b],10.10.1.129:9997[144274cd317000a],10.10.1.114:9997[144274cd3170009],10.10.1.107:9997[144274cd3170008]}} root@sqrrl>
          Hide
          Chris McCubbin added a comment -

          screencaps of server monitor page while this was happening

          Show
          Chris McCubbin added a comment - screencaps of server monitor page while this was happening
          Hide
          Chris McCubbin added a comment -

          Jstacks of master and tserver taken a few seconds apart while the condition was occurring.

          Show
          Chris McCubbin added a comment - Jstacks of master and tserver taken a few seconds apart while the condition was occurring.
          Hide
          Eric Newton added a comment -

          My test can reliably produce the "saw inconsistency" message, but the master does not get stuck: it keeps executing the state machine and all my tables get deleted.

          If you don't have log messages from this incident, I'm going to close this with "Cannot reproduce."

          Show
          Eric Newton added a comment - My test can reliably produce the "saw inconsistency" message, but the master does not get stuck: it keeps executing the state machine and all my tables get deleted. If you don't have log messages from this incident, I'm going to close this with "Cannot reproduce."
          Hide
          ASF subversion and git services added a comment -

          Commit ad9f9c13dce4c9bc635c4645a5bafd04131a33cd in accumulo's branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=ad9f9c1 ]

          ACCUMULO-2361 made test a functional SimpleIT for 1.6 and later

          Show
          ASF subversion and git services added a comment - Commit ad9f9c13dce4c9bc635c4645a5bafd04131a33cd in accumulo's branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=ad9f9c1 ] ACCUMULO-2361 made test a functional SimpleIT for 1.6 and later
          Hide
          ASF subversion and git services added a comment -

          Commit 00ff44819bf97063ad04a96c924a2fde0a10bdd6 in accumulo's branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=00ff448 ]

          ACCUMULO-2361 IT to reproduce

          Show
          ASF subversion and git services added a comment - Commit 00ff44819bf97063ad04a96c924a2fde0a10bdd6 in accumulo's branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=00ff448 ] ACCUMULO-2361 IT to reproduce
          Hide
          ASF subversion and git services added a comment -

          Commit 00ff44819bf97063ad04a96c924a2fde0a10bdd6 in accumulo's branch refs/heads/1.6.0-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=00ff448 ]

          ACCUMULO-2361 IT to reproduce

          Show
          ASF subversion and git services added a comment - Commit 00ff44819bf97063ad04a96c924a2fde0a10bdd6 in accumulo's branch refs/heads/1.6.0-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=00ff448 ] ACCUMULO-2361 IT to reproduce
          Hide
          ASF subversion and git services added a comment -

          Commit 00ff44819bf97063ad04a96c924a2fde0a10bdd6 in accumulo's branch refs/heads/1.5.1-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=00ff448 ]

          ACCUMULO-2361 IT to reproduce

          Show
          ASF subversion and git services added a comment - Commit 00ff44819bf97063ad04a96c924a2fde0a10bdd6 in accumulo's branch refs/heads/1.5.1-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=00ff448 ] ACCUMULO-2361 IT to reproduce
          Hide
          ASF subversion and git services added a comment -

          Commit 1a66f3bd851cc171b26c6b9e62c515338207afac in accumulo's branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=1a66f3b ]

          ACCUMULO-2361 fix missing space in error message

          Show
          ASF subversion and git services added a comment - Commit 1a66f3bd851cc171b26c6b9e62c515338207afac in accumulo's branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=1a66f3b ] ACCUMULO-2361 fix missing space in error message
          Hide
          ASF subversion and git services added a comment -

          Commit 1a66f3bd851cc171b26c6b9e62c515338207afac in accumulo's branch refs/heads/1.6.0-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=1a66f3b ]

          ACCUMULO-2361 fix missing space in error message

          Show
          ASF subversion and git services added a comment - Commit 1a66f3bd851cc171b26c6b9e62c515338207afac in accumulo's branch refs/heads/1.6.0-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=1a66f3b ] ACCUMULO-2361 fix missing space in error message
          Hide
          Eric Newton added a comment -

          I'll take anything you've got.

          Could you scan the !METADATA table when the master was locked up on this?

          Show
          Eric Newton added a comment - I'll take anything you've got. Could you scan the !METADATA table when the master was locked up on this?
          Hide
          Chris McCubbin added a comment -

          Yes, the master was stuck in this lop until I hard-rebooted accumulo. Until I did that accumulo was totally non-responsive to anything. The monitor kept updating but i couldn't insert, add/drop tables, etc.

          I'm thinking it must be some kind of race condition, because I've seen this happen a couple of times now but I can't reliably reproduce the effect.

          I have some other information from this instance if you need it, like listscans from the shell and master jstacks.

          Show
          Chris McCubbin added a comment - Yes, the master was stuck in this lop until I hard-rebooted accumulo. Until I did that accumulo was totally non-responsive to anything. The monitor kept updating but i couldn't insert, add/drop tables, etc. I'm thinking it must be some kind of race condition, because I've seen this happen a couple of times now but I can't reliably reproduce the effect. I have some other information from this instance if you need it, like listscans from the shell and master jstacks.
          Hide
          Eric Newton added a comment -

          Chris McCubbin I'm trying to reproduce the problem in an integration test ... without much luck.

          Do you happen to know how long the master was stuck and repeated this message about inconsistency?

          I was able to reproduce the message, but it only shows up once.

          Show
          Eric Newton added a comment - Chris McCubbin I'm trying to reproduce the problem in an integration test ... without much luck. Do you happen to know how long the master was stuck and repeated this message about inconsistency? I was able to reproduce the message, but it only shows up once.
          Hide
          Eric Newton added a comment -

          Strange. The master is seeing the first tablet as: 4;\x00\x00\x0c?;\x00\x00\x06f and the first tablet is 4;\x00\x00\x0c?<. It could be confused by a recent split, but eventually it should see the METADATA table scan you included above (which looks correct). Maybe there's some caching that isn't getting reset because we're not trying to find the tablets?

          Show
          Eric Newton added a comment - Strange. The master is seeing the first tablet as: 4;\x00\x00\x0c?;\x00\x00\x06f and the first tablet is 4;\x00\x00\x0c?< . It could be confused by a recent split, but eventually it should see the METADATA table scan you included above (which looks correct). Maybe there's some caching that isn't getting reset because we're not trying to find the tablets?

            People

            • Assignee:
              Unassigned
              Reporter:
              Chris McCubbin
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development