Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.5.1, 1.6.0
    • Component/s: master, tserver
    • Labels:
      None
    • Environment:

      hadoop 2.2.0 and zookeeper 3.4.5

      Description

      Anthony F reports the following:

      I have observed a loss of data when tservers fail during bulk ingest. The keys that are missing are right around the table's splits indicating that data was lost when a tserver died during a split. I am using Accumulo 1.5.0. At around the same time, I observe the master logging a message about "Found two locations for the same extent".

      And:

      I'm currently digging through the logs and will report back. Keep in mind, I'm using Accumulo 1.5.0 on a Hadoop 2.2.0 stack. To determine data loss, I have a 'ConsistencyCheckingIterator' that verifies each row has the expected data (it takes a long time to scan the whole table). Below is a quick summary of what happened. The tablet in question is "d;72~gcm~201304". Notice that it is assigned to 192.168.2.233:9997[343bc1fa155242c] at 2014-01-25 09:49:36,233. At 2014-01-25 09:49:54,141, the tserver goes away. Then, the tablet gets assigned to 192.168.2.223:9997[143bc1f14412432] and shortly after that, I see the BadLocationStateException. The master never recovers from the BLSE - I have to manually delete one of the offending locations.

      2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning tablet d;72~gcm~201304;72=192.168.2.233:9997[343bc1fa155242c]
      2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning tablet p;18~thm~2012101;18=192.168.2.233:9997[343bc1fa155242c]
      2014-01-25 09:49:54,141 [master.Master] WARN : Lost servers [192.168.2.233:9997[343bc1fa155242c]]
      2014-01-25 09:49:56,866 [master.Master] DEBUG: 42 assigned to dead servers: [d;03~u36~201302;03~thm~2012091@(null,192.168.2.233:9997[343bc1fa155242c],null), d;06~u36~2013;06~thm~2012083@(null,192.168.2.233:9997[343bc1fa155242c],null), d;25;24~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null), d;25~u36~201303;25~thm~201209@(null,192.168.2.233:9997[343bc1fa155242c],null), d;27~gcm~2013041;27@(null,192.168.2.233:9997[343bc1fa155242c],null), d;30~u36~2013031;30~thm~2012082@(null,192.168.2.233:9997[343bc1fa155242c],null), d;34~thm;34~gcm~2013022@(null,192.168.2.233:9997[343bc1fa155242c],null), d;39~thm~20121;39~gcm~20130418@(null,192.168.2.233:9997[343bc1fa155242c],null), d;41~thm;41~gcm~2013041@(null,192.168.2.233:9997[343bc1fa155242c],null), d;42~u36~201304;42~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null), d;45~thm~201208;45~gcm~201303@(null,192.168.2.233:9997[343bc1fa155242c],null), d;48~gcm~2013052;48@(null,192.168.2.233:9997[343bc1fa155242c],null), d;60~u36~2013021;60~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null), d;68~gcm~2013041;68@(null,192.168.2.233:9997[343bc1fa155242c],null), d;72;71~u36~2013@(null,192.168.2.233:9997[343bc1fa155242c],null), d;72~gcm~201304;72@(192.168.2.233:9997[343bc1fa155242c],null,null), d;75~thm~2012101;75~gcm~2013032@(null,192.168.2.233:9997[343bc1fa155242c],null), d;78;77~u36~201305@(null,192.168.2.233:9997[343bc1fa155242c],null), d;90~u36~2013032;90~thm~2012092@(null,192.168.2.233:9997[343bc1fa155242c],null), d;91~thm;91~gcm~201304@(null,192.168.2.233:9997[343bc1fa155242c],null), d;93~u36~2013012;93~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null), m;20;19@(null,192.168.2.233:9997[343bc1fa155242c],null), m;38;37@(null,192.168.2.233:9997[343bc1fa155242c],null), m;51;50@(null,192.168.2.233:9997[343bc1fa155242c],null), m;60;59@(null,192.168.2.233:9997[343bc1fa155242c],null), m;92;91@(null,192.168.2.233:9997[343bc1fa155242c],null), o;01<@(null,192.168.2.233:9997[343bc1fa155242c],null), o;04;03@(null,192.168.2.233:9997[343bc1fa155242c],null), o;50;49@(null,192.168.2.233:9997[343bc1fa155242c],null), o;63;62@(null,192.168.2.233:9997[343bc1fa155242c],null), o;74;73@(null,192.168.2.233:9997[343bc1fa155242c],null), o;97;96@(null,192.168.2.233:9997[343bc1fa155242c],null), p;08~thm~20121;08@(null,192.168.2.233:9997[343bc1fa155242c],null), p;09~thm~20121;09@(null,192.168.2.233:9997[343bc1fa155242c],null), p;10;09~thm~20121@(null,192.168.2.233:9997[343bc1fa155242c],null), p;18~thm~2012101;18@(192.168.2.233:9997[343bc1fa155242c],null,null), p;21;20~thm~201209@(null,192.168.2.233:9997[343bc1fa155242c],null), p;22~thm~2012091;22@(null,192.168.2.233:9997[343bc1fa155242c],null), p;23;22~thm~2012091@(null,192.168.2.233:9997[343bc1fa155242c],null), p;41~thm~2012111;41@(null,192.168.2.233:9997[343bc1fa155242c],null), p;42;41~thm~2012111@(null,192.168.2.233:9997[343bc1fa155242c],null), p;58~thm~201208;58@(null,192.168.2.233:9997[343bc1fa155242c],null)]...
      2014-01-25 09:49:59,706 [master.Master] DEBUG: Normal Tablets assigning tablet d;72~gcm~201304;72=192.168.2.223:9997[143bc1f14412432]
      2014-01-25 09:50:13,515 [master.EventCoordinator] INFO : tablet d;72~gcm~201304;72 was loaded on 192.168.2.223:9997
      2014-01-25 09:51:20,058 [state.MetaDataTableScanner] ERROR: java.lang.RuntimeException: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found two locations for the same extent d;72~gcm~201304: 192.168.2.223:9997[143bc1f14412432] and 192.168.2.233:9997[343bc1fa155242c]
      java.lang.RuntimeException: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found two locations for the same extent d;72~gcm~201304: 192.168.2.223:9997[143bc1f14412432] and 192.168.2.233:9997[343bc1fa155242c]
      

        Issue Links

          Activity

          Hide
          ecn Eric Newton added a comment -

          Actually, I don't recommend deleting both loc entries... only last entries. If you have two loc's we'll want to know why.

          Since 10.157.33.251:9997 had just recompacted the tablet !0;~, I would kill 10.157.42.152:9997 and remove the entry:

          !0;~ loc:244060a193e001d

          But that's just playing the odds... if you can reproduce this and provide logs, please let me know.

          Show
          ecn Eric Newton added a comment - Actually, I don't recommend deleting both loc entries... only last entries. If you have two loc's we'll want to know why. Since 10.157.33.251:9997 had just recompacted the tablet !0;~, I would kill 10.157.42.152:9997 and remove the entry: !0;~ loc:244060a193e001d But that's just playing the odds... if you can reproduce this and provide logs, please let me know.
          Hide
          bills William Slacum added a comment -

          I've seen this twice in the past two days on a 1.5.0 instance on AWS. The first time I could repair it because it was on a user table, but now this happened on the !METADATA table.

          I'm unable to scan the !METADATA table, so I had to offline it and dig through some of its RFiles. In the root_tablet directory, I found:

          !!~del/!0/default_tablet/A00006zz.rf : [] 8189 false ->
          !!~del/!0/table_info/A00006zx.rf : [] 8186 false ->
          !!~del/!0/table_info/F0000704.rf : [] 8187 false ->
          !0;!0< srv:dir [] 0 false -> /root_tablet
          !0;!0< ~tab:~pr [] 0 false -> ^@
          !0;~ file:/table_info/A0000706.rf [] 8188 false -> 4457,270
          !0;~ last:144060a192d001b [] 8188 false -> 10.157.33.251:9997
          !0;~ loc:144060a192d001b [] 7251 false -> 10.157.33.251:9997
          !0;~ loc:244060a193e001d [] 7256 false -> 10.157.42.152:9997
          !0;~ srv:compact [] 8188 false -> 673
          !0;~ srv:dir [] 0 false -> /table_info
          !0;~ srv:flush [] 8185 false -> 674
          !0;~ srv:lock [] 8188 false -> tservers/10.157.33.251:9997/zlock-0000000003$144060a192d001b
          !0;~ srv:time [] 8185 false -> L3060
          !0;~ ~tab:~pr [] 0 false -> ^A!0<
          !0< file:/default_tablet/A0000707.rf [] 8190 false -> 310,3
          !0< last:144060a192d001b [] 8190 false -> 10.157.33.251:9997
          !0< loc:144060a192d001b [] 7244 false -> 10.157.33.251:9997
          !0< srv:compact [] 8190 false -> 674
          !0< srv:dir [] 0 false -> /default_tablet
          !0< srv:flush [] 8184 false -> 674
          !0< srv:lock [] 8190 false -> tservers/10.157.33.251:9997/zlock-0000000003$144060a192d001b
          !0< srv:time [] 7624 false -> L2284
          !0< ~tab:~pr [] 0 false -> ^A~
          

          Unfortunately I don't have any logs due to a configuration issue. I do believe that both of those servers were actually alive at the time the issue was noticed. I talked to Eric Newton yesterday when this was on a user table, and he told me if both servers are alive, I should delete both loc entries. What are the effects of doing that? Are they different when they occur on the root tablet?

          Show
          bills William Slacum added a comment - I've seen this twice in the past two days on a 1.5.0 instance on AWS. The first time I could repair it because it was on a user table, but now this happened on the !METADATA table. I'm unable to scan the !METADATA table, so I had to offline it and dig through some of its RFiles. In the root_tablet directory, I found: !!~del/!0/default_tablet/A00006zz.rf : [] 8189 false -> !!~del/!0/table_info/A00006zx.rf : [] 8186 false -> !!~del/!0/table_info/F0000704.rf : [] 8187 false -> !0;!0< srv:dir [] 0 false -> /root_tablet !0;!0< ~tab:~pr [] 0 false -> ^@ !0;~ file:/table_info/A0000706.rf [] 8188 false -> 4457,270 !0;~ last:144060a192d001b [] 8188 false -> 10.157.33.251:9997 !0;~ loc:144060a192d001b [] 7251 false -> 10.157.33.251:9997 !0;~ loc:244060a193e001d [] 7256 false -> 10.157.42.152:9997 !0;~ srv:compact [] 8188 false -> 673 !0;~ srv:dir [] 0 false -> /table_info !0;~ srv:flush [] 8185 false -> 674 !0;~ srv:lock [] 8188 false -> tservers/10.157.33.251:9997/zlock-0000000003$144060a192d001b !0;~ srv:time [] 8185 false -> L3060 !0;~ ~tab:~pr [] 0 false -> ^A!0< !0< file:/default_tablet/A0000707.rf [] 8190 false -> 310,3 !0< last:144060a192d001b [] 8190 false -> 10.157.33.251:9997 !0< loc:144060a192d001b [] 7244 false -> 10.157.33.251:9997 !0< srv:compact [] 8190 false -> 674 !0< srv:dir [] 0 false -> /default_tablet !0< srv:flush [] 8184 false -> 674 !0< srv:lock [] 8190 false -> tservers/10.157.33.251:9997/zlock-0000000003$144060a192d001b !0< srv:time [] 7624 false -> L2284 !0< ~tab:~pr [] 0 false -> ^A~ Unfortunately I don't have any logs due to a configuration issue. I do believe that both of those servers were actually alive at the time the issue was noticed. I talked to Eric Newton yesterday when this was on a user table, and he told me if both servers are alive, I should delete both loc entries. What are the effects of doing that? Are they different when they occur on the root tablet?
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 772ca16bc8cbaef6db4bbd440145dc744947fedb in branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=772ca16 ]

          ACCUMULO-2261 cherry pick back 1.5

          Show
          jira-bot ASF subversion and git services added a comment - Commit 772ca16bc8cbaef6db4bbd440145dc744947fedb in branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=772ca16 ] ACCUMULO-2261 cherry pick back 1.5
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 772ca16bc8cbaef6db4bbd440145dc744947fedb in branch refs/heads/1.6.0-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=772ca16 ]

          ACCUMULO-2261 cherry pick back 1.5

          Show
          jira-bot ASF subversion and git services added a comment - Commit 772ca16bc8cbaef6db4bbd440145dc744947fedb in branch refs/heads/1.6.0-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=772ca16 ] ACCUMULO-2261 cherry pick back 1.5
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 772ca16bc8cbaef6db4bbd440145dc744947fedb in branch refs/heads/1.5.1-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=772ca16 ]

          ACCUMULO-2261 cherry pick back 1.5

          Show
          jira-bot ASF subversion and git services added a comment - Commit 772ca16bc8cbaef6db4bbd440145dc744947fedb in branch refs/heads/1.5.1-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=772ca16 ] ACCUMULO-2261 cherry pick back 1.5
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f5c35b9502ba107b8384a0760b1b9ba8e056ef85 in branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f5c35b9 ]

          ACCUMULO-2261 remove unexpected future and loc entries for dead servers from the metadata tables

          Show
          jira-bot ASF subversion and git services added a comment - Commit f5c35b9502ba107b8384a0760b1b9ba8e056ef85 in branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f5c35b9 ] ACCUMULO-2261 remove unexpected future and loc entries for dead servers from the metadata tables
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f5c35b9502ba107b8384a0760b1b9ba8e056ef85 in branch refs/heads/1.6.0-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f5c35b9 ]

          ACCUMULO-2261 remove unexpected future and loc entries for dead servers from the metadata tables

          Show
          jira-bot ASF subversion and git services added a comment - Commit f5c35b9502ba107b8384a0760b1b9ba8e056ef85 in branch refs/heads/1.6.0-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=f5c35b9 ] ACCUMULO-2261 remove unexpected future and loc entries for dead servers from the metadata tables
          Hide
          kturner Keith Turner added a comment -

          Josh Elser

          Right, I was just making the distinction between making the check in the AssignmentHandler and in the MetadataConstraints.

          Doing the check in the metadata constraint means it will be done later. If the assignment handler does the check and then issues a metadata table update, the actual update could take place much later (even after the requesting tserver has died). Ifs also possible that the lock could be lost when doing the check in the constraint. The check in the constraint is sanity check, but it does not prevent all problems. Doing the sanity check as a late as possible is better. Note that even if zoocache is not used, there is still a race condition. If the lock is lost after zookeeper is checked directly, then its a problem. Using conditional mutations (and CAS in zookeeper) is a better solution. It allows us to only make the mutation is future and current locations are in the expected state.

          Eric Newton I don't think we should make 1.7 use conditional mutations for this case. I thnik 1.7 should have a ticket to make one big change to make all metadata operations use CAS w/ zookeeper and metadata table in places where it makes sense. Seems like it would be nice to do this in a branch and work out the kinks and then merge it in. I will open a ticket. This is something I was thinking of donig for 1.6, but never got around to it.

          Show
          kturner Keith Turner added a comment - Josh Elser Right, I was just making the distinction between making the check in the AssignmentHandler and in the MetadataConstraints. Doing the check in the metadata constraint means it will be done later. If the assignment handler does the check and then issues a metadata table update, the actual update could take place much later (even after the requesting tserver has died). Ifs also possible that the lock could be lost when doing the check in the constraint. The check in the constraint is sanity check, but it does not prevent all problems. Doing the sanity check as a late as possible is better. Note that even if zoocache is not used, there is still a race condition. If the lock is lost after zookeeper is checked directly, then its a problem. Using conditional mutations (and CAS in zookeeper) is a better solution. It allows us to only make the mutation is future and current locations are in the expected state. Eric Newton I don't think we should make 1.7 use conditional mutations for this case. I thnik 1.7 should have a ticket to make one big change to make all metadata operations use CAS w/ zookeeper and metadata table in places where it makes sense. Seems like it would be nice to do this in a branch and work out the kinks and then merge it in. I will open a ticket. This is something I was thinking of donig for 1.6, but never got around to it.
          Hide
          elserj Josh Elser added a comment -

          That's what I meant with this:

          have to make a full zookeeper check in the METADATA table constraint

          Right, I was just making the distinction between making the check in the AssignmentHandler and in the MetadataConstraints.

          Regarding testing, let me know – would be more than happy to help write/test/research/etc.

          Show
          elserj Josh Elser added a comment - That's what I meant with this: have to make a full zookeeper check in the METADATA table constraint Right, I was just making the distinction between making the check in the AssignmentHandler and in the MetadataConstraints. Regarding testing, let me know – would be more than happy to help write/test/research/etc.
          Hide
          ecn Eric Newton added a comment -

          Isn't another possibility to circumvent the ZooCache in the TabletServer

          That's what I meant with this:

          have to make a full zookeeper check in the METADATA table constraint

          We need to test it in the context of the only place where updates to metadata table information is atomic: at the tablet server serving the metadata tablet (root table excluded).

          We need to test it. I been thinking for the last hour thinking of what I would expect the performance to be in a cold-start... and I really don't know what to expect. I know there's network latency to make the RPC call, but zookeeper can perform batches of read-only operations on slave nodes. Do we have to make a sync() call to ensure we have the most up-to-date information? What does that require? We need a scale test (perhaps there's a benchmark out there), to tell us what the cost would be.

          Show
          ecn Eric Newton added a comment - Isn't another possibility to circumvent the ZooCache in the TabletServer That's what I meant with this: have to make a full zookeeper check in the METADATA table constraint We need to test it in the context of the only place where updates to metadata table information is atomic: at the tablet server serving the metadata tablet (root table excluded). We need to test it. I been thinking for the last hour thinking of what I would expect the performance to be in a cold-start... and I really don't know what to expect. I know there's network latency to make the RPC call, but zookeeper can perform batches of read-only operations on slave nodes. Do we have to make a sync() call to ensure we have the most up-to-date information? What does that require? We need a scale test (perhaps there's a benchmark out there), to tell us what the cost would be.
          Hide
          elserj Josh Elser added a comment -

          It looks like a race between the dying tserver and the zookeeper cache update. Keith Turner and I discussed fixing this in 1.6 using conditional mutations. In 1.5 we would have to make a full zookeeper check in the METADATA table constraint, which could be performance issue.

          I was poking around at this. Isn't another possibility to circumvent the ZooCache in the TabletServer before calling TabletStateStore.setLocation(assignment); or verify that the lock is held before invoking that setLocation method? Instead of trying to push that down to a constraint, we could make sure that we don't do a cached read of ZK. If it weren't in the constraint, it shouldn't cause a performance loss on the METADATA table, but just slow down assignments instead, no?

          I'm not really sure of relative performance impact of a full zk check over what ZooCache/ZooLock provides. A lot of our ZK wrapper classes tends to make my head spin.

          Show
          elserj Josh Elser added a comment - It looks like a race between the dying tserver and the zookeeper cache update. Keith Turner and I discussed fixing this in 1.6 using conditional mutations. In 1.5 we would have to make a full zookeeper check in the METADATA table constraint, which could be performance issue. I was poking around at this. Isn't another possibility to circumvent the ZooCache in the TabletServer before calling TabletStateStore.setLocation(assignment); or verify that the lock is held before invoking that setLocation method? Instead of trying to push that down to a constraint, we could make sure that we don't do a cached read of ZK. If it weren't in the constraint, it shouldn't cause a performance loss on the METADATA table, but just slow down assignments instead, no? I'm not really sure of relative performance impact of a full zk check over what ZooCache/ZooLock provides. A lot of our ZK wrapper classes tends to make my head spin.
          Hide
          ecn Eric Newton added a comment -

          Anthony Fox] are you sure that the other processes on your nodes wouldn't push a tserver into swap? If you are really this unlucky, I want you to help test all future releases.

          I've written a fix using conditional mutations, but I'm going to leave it for 1.7. Conditional mutations are new and the METADATA table is a tricky beast. And we still need a fix for the root table pointer. Instead, I'll have the master repair the metadata information if it finds it is relatively safe to do so. So, if it sees multiple locations, and one is a dead tserver, it will remove the entry. I'll back-port it to 1.5, too.

          I'm decreasing the severity because there is a known work-around (removing the entry in the !METADATA table by hand). Feel free to raise it back up if you disagree.

          Show
          ecn Eric Newton added a comment - Anthony Fox ] are you sure that the other processes on your nodes wouldn't push a tserver into swap? If you are really this unlucky, I want you to help test all future releases. I've written a fix using conditional mutations, but I'm going to leave it for 1.7. Conditional mutations are new and the METADATA table is a tricky beast. And we still need a fix for the root table pointer. Instead, I'll have the master repair the metadata information if it finds it is relatively safe to do so. So, if it sees multiple locations, and one is a dead tserver, it will remove the entry. I'll back-port it to 1.5, too. I'm decreasing the severity because there is a known work-around (removing the entry in the !METADATA table by hand). Feel free to raise it back up if you disagree.
          Hide
          anthonyf Anthony Fox added a comment -

          System swappiness is set to zero

          $ cat /proc/sys/vm/swappiness 
          0
          

          Tserver opts are:

          ACCUMULO_TSERVER_OPTS="${POLICY} -Xmx12g -Xms12g -XX:NewSize=1G -XX:MaxNewSize=1G "
          
          Show
          anthonyf Anthony Fox added a comment - System swappiness is set to zero $ cat /proc/sys/vm/swappiness 0 Tserver opts are: ACCUMULO_TSERVER_OPTS="${POLICY} -Xmx12g -Xms12g -XX:NewSize=1G -XX:MaxNewSize=1G "
          Hide
          ecn Eric Newton added a comment -

          Master assigns the tablet:

          2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning tablet d;72~gcm~201304;72=192.168.2.233:9997[343bc1fa155242c]
          

          Tablet server loads the tablet

          2014-01-25 09:49:21,472 [tabletserver.TabletServer] DEBUG: gc ParNew=904.08(+0.99) secs ConcurrentMarkSweep=6.22(+0.00) secs freemem=3,574,451,336(-158,848,464) totalmem=12,777,553,920
          2014-01-25 09:49:44,408 [tabletserver.TabletServer] INFO : Loading tablet d;72~gcm~201304;72
          

          tablet server loses its lock

          2014-01-25 09:49:44,560 [tabletserver.TabletServer] DEBUG: gc ParNew=926.22(+22.14) secs ConcurrentMarkSweep=6.25(+0.03) secs freemem=3,477,953,072(-96,498,264) totalmem=12,777,553,920
          2014-01-25 09:49:54,141 [master.Master] WARN : Lost servers [192.168.2.233:9997[343bc1fa155242c]]
          

          That last line with GC debugging say that the ParNew gc took over 22 seconds, and ParNew is a stop-the-world GC and caused the server to lose its lock. Yet it continued to run for 15 more seconds.

          The master reassigns the tablet.

          2014-01-25 09:49:59,706 [master.Master] DEBUG: Normal Tablets assigning tablet d;72~gcm~201304;72=192.168.2.223:9997[143bc1f14412432]
          

          It looks like a race between the dying tserver and the zookeeper cache update. Keith Turner and I discussed fixing this in 1.6 using conditional mutations. In 1.5 we would have to make a full zookeeper check in the METADATA table constraint, which could be performance issue. We could also have the master fix the !METADATA table instead of leaving the system crippled.

          Anthony Fox are you having issues with tservers frequently losing their locks? A few things to check: make sure swappiness is set to zero, make sure the size of the ParNew is limited (-XX:NewSize=1G -XX:MaxNewSize=1G), and check to make sure that system memory isn't oversubscribed.

          Show
          ecn Eric Newton added a comment - Master assigns the tablet: 2014-01-25 09:49:36,233 [master.Master] DEBUG: Normal Tablets assigning tablet d;72~gcm~201304;72=192.168.2.233:9997[343bc1fa155242c] Tablet server loads the tablet 2014-01-25 09:49:21,472 [tabletserver.TabletServer] DEBUG: gc ParNew=904.08(+0.99) secs ConcurrentMarkSweep=6.22(+0.00) secs freemem=3,574,451,336(-158,848,464) totalmem=12,777,553,920 2014-01-25 09:49:44,408 [tabletserver.TabletServer] INFO : Loading tablet d;72~gcm~201304;72 tablet server loses its lock 2014-01-25 09:49:44,560 [tabletserver.TabletServer] DEBUG: gc ParNew=926.22(+22.14) secs ConcurrentMarkSweep=6.25(+0.03) secs freemem=3,477,953,072(-96,498,264) totalmem=12,777,553,920 2014-01-25 09:49:54,141 [master.Master] WARN : Lost servers [192.168.2.233:9997[343bc1fa155242c]] That last line with GC debugging say that the ParNew gc took over 22 seconds, and ParNew is a stop-the-world GC and caused the server to lose its lock. Yet it continued to run for 15 more seconds. The master reassigns the tablet. 2014-01-25 09:49:59,706 [master.Master] DEBUG: Normal Tablets assigning tablet d;72~gcm~201304;72=192.168.2.223:9997[143bc1f14412432] It looks like a race between the dying tserver and the zookeeper cache update. Keith Turner and I discussed fixing this in 1.6 using conditional mutations. In 1.5 we would have to make a full zookeeper check in the METADATA table constraint, which could be performance issue. We could also have the master fix the !METADATA table instead of leaving the system crippled. Anthony Fox are you having issues with tservers frequently losing their locks? A few things to check: make sure swappiness is set to zero, make sure the size of the ParNew is limited (-XX:NewSize=1G -XX:MaxNewSize=1G), and check to make sure that system memory isn't oversubscribed.
          Hide
          anthonyf Anthony Fox added a comment - - edited

          We have ntp setup to synchronize system clocks. Here's the log from 192.168.2.233.

          applied jira formatting

          2014-01-25 09:49:11,708 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: Total=483.68814MB (507183784), Free=1564.3119MB (1640299864), Max=2048.0MB (2147483648), Counts: Blocks=835, Access=30151167, Hit=30150280, 
          Miss=887, Evictions=0, Evicted=0, Ratios: Hit Ratio=99.9970555305481%, Miss Ratio=0.0029418428312055767%, Evicted/Run=NaN, Duplicate Reads=18
          2014-01-25 09:49:11,711 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: Total=6742.9854MB (7070532384), Free=1449.0149MB (1519402208), Max=8192.0MB (8589934592), Counts: Blocks=68811, Access=43909183, Hit=4336353
          1, Miss=545652, Evictions=57, Evicted=476838, Ratios: Hit Ratio=98.75731468200684%, Miss Ratio=1.242683082818985%, Evicted/Run=8365.5791015625, Duplicate Reads=0
          2014-01-25 09:49:11,712 [tabletserver.TabletServer] DEBUG: gc ParNew=897.13(+17.22) secs ConcurrentMarkSweep=6.22(+0.00) secs freemem=4,064,834,072(+1,745,404,552) totalmem=12,777,553,920
          2014-01-25 09:49:17,455 [tabletserver.TabletServer] DEBUG: gc ParNew=901.79(+4.67) secs ConcurrentMarkSweep=6.22(+0.00) secs freemem=3,913,871,704(+667,572,648) totalmem=12,777,553,920
          2014-01-25 09:49:17,871 [cache.LruBlockCache] DEBUG: Block cache LRU eviction started.  Attempting to free 859003704 bytes
          2014-01-25 09:49:17,942 [cache.LruBlockCache] DEBUG: Block cache LRU eviction completed. Freed 859019280 bytes.  Priority Sizes: Single=2512.0676MB (2634093856), Multi=4456.495MB (4672973712),Memory=0.0MB (0)
          2014-01-25 09:49:19,569 [tabletserver.TabletServer] DEBUG: gc ParNew=903.09(+1.30) secs ConcurrentMarkSweep=6.22(+0.00) secs freemem=3,733,299,800(-180,571,904) totalmem=12,777,553,920
          2014-01-25 09:49:21,472 [tabletserver.TabletServer] DEBUG: gc ParNew=904.08(+0.99) secs ConcurrentMarkSweep=6.22(+0.00) secs freemem=3,574,451,336(-158,848,464) totalmem=12,777,553,920
          2014-01-25 09:49:44,408 [tabletserver.TabletServer] DEBUG: Got loadTablet message from user: !SYSTEM
          2014-01-25 09:49:44,408 [tabletserver.TabletServer] INFO : Loading tablet d;72~gcm~201304;72
          2014-01-25 09:49:44,408 [tabletserver.TabletServer] INFO : dcloud18/192.168.2.233:9997: got assignment from master: d;72~gcm~201304;72
          2014-01-25 09:49:44,408 [tabletserver.TabletServer] DEBUG: Loading extent: d;72~gcm~201304;72
          2014-01-25 09:49:44,409 [tabletserver.TabletServer] DEBUG: verifying extent d;72~gcm~201304;72
          2014-01-25 09:49:44,418 [tabletserver.TabletServer] DEBUG: Got loadTablet message from user: !SYSTEM
          2014-01-25 09:49:44,418 [tabletserver.TabletServer] INFO : Loading tablet p;18~thm~2012101;18
          2014-01-25 09:49:44,456 [tabletserver.TabletServer] DEBUG: Got loadTablet message from user: !SYSTEM
          2014-01-25 09:49:44,557 [tabletserver.Tablet] DEBUG: Looking at metadata {d;72~gcm~201304 file:/b-00028xn/I0002i57.rf [] 40757550 false=1625685,0,1389849852187, d;72~gcm~201304 file:/b-00028xn/I0002i5f.rf [] 40757699 false=1592695,0,1389849860660, d;72~gcm~201304 file:/t-0001ect/C0001nrv.rf [] 40418333 false=178655073,3168593, d;72~gcm~201304 file:/t-0001rzt/C0001u7r.rf [] 40441613 false=232169298,4085196, d;72~gcm~201304 file:/t-0001rzt/C0002evg.rf [] 40708788 false=281016452,4939001, d;72~gcm~201304 file:/t-0001rzt/C0002j3b.rf [] 40750189 false=36970878,670358, d;72~gcm~201304 file:/t-0001rzt/C0002j5f.rf [] 40756562 false=22086620,394464, d;72~gcm~201304 future:343bc1fa155242c [] 41048300 false=192.168.2.233:9997, d;72~gcm~201304 last:3438da6a936264c [] 40756562 false=192.168.2.225:9997, d;72~gcm~201304 srv:dir [] 40418333 false=/t-0001rzt, d;72~gcm~201304 srv:flush [] 40825738 false=9, d;72~gcm~201304 srv:lock [] 40825738 false=tservers/192.168.2.223:9997/zlock-0000000001$243b1084d9e1273, d;72~gcm~201304 srv:time [] 40757699 false=M1389849860660, d;72~gcm~201304 ~tab:~pr [] 40418333 false=^A72}
          2014-01-25 09:49:44,557 [tabletserver.Tablet] DEBUG: got [] for logs for d;72~gcm~201304;72
          2014-01-25 09:49:44,557 [tabletserver.TabletServer] DEBUG: Got loadTablet message from user: !SYSTEM
          2014-01-25 09:49:44,557 [zookeeper.ZooLock] DEBUG: Unexpected event watching lock node WatchedEvent state:Disconnected type:None path:null /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/tservers/192.168.2.233:9997/zlock-0000000000
          2014-01-25 09:49:44,560 [tabletserver.TabletServer] DEBUG: gc ParNew=926.22(+22.14) secs ConcurrentMarkSweep=6.25(+0.03) secs freemem=3,477,953,072(-96,498,264) totalmem=12,777,553,920
          2014-01-25 09:49:44,560 [tabletserver.TabletServer] DEBUG: Got loadTablet message from user: !SYSTEM
          2014-01-25 09:50:09,984 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: Total=6630.77MB (6952866144), Free=1561.2301MB (1637068448), Max=8192.0MB (8589934592), Counts: Blocks=67662, Access=43916676, Hit=43363810, Miss=552866, Evictions=58, Evicted=485201, Ratios: Hit Ratio=98.74109625816345%, Miss Ratio=1.2588976882398129%, Evicted/Run=8365.5341796875, Duplicate Reads=0
          2014-01-25 09:50:09,984 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: Total=513.6937MB (538646880), Free=1534.3063MB (1608836768), Max=2048.0MB (2147483648), Counts: Blocks=845, Access=30157409, Hit=30156512, Miss=897, Evictions=0, Evicted=0, Ratios: Hit Ratio=99.99703168869019%, Miss Ratio=0.0029743934646830894%, Evicted/Run=NaN, Duplicate Reads=18
          2014-01-25 09:50:28,730 [zookeeper.ZooCache] WARN : Zookeeper error, will retry
          org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/config/tserver.client.timeout
                  at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
                  at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
                  at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
                  at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
          
          Show
          anthonyf Anthony Fox added a comment - - edited We have ntp setup to synchronize system clocks. Here's the log from 192.168.2.233. applied jira formatting 2014-01-25 09:49:11,708 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: Total=483.68814MB (507183784), Free=1564.3119MB (1640299864), Max=2048.0MB (2147483648), Counts: Blocks=835, Access=30151167, Hit=30150280, Miss=887, Evictions=0, Evicted=0, Ratios: Hit Ratio=99.9970555305481%, Miss Ratio=0.0029418428312055767%, Evicted/Run=NaN, Duplicate Reads=18 2014-01-25 09:49:11,711 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: Total=6742.9854MB (7070532384), Free=1449.0149MB (1519402208), Max=8192.0MB (8589934592), Counts: Blocks=68811, Access=43909183, Hit=4336353 1, Miss=545652, Evictions=57, Evicted=476838, Ratios: Hit Ratio=98.75731468200684%, Miss Ratio=1.242683082818985%, Evicted/Run=8365.5791015625, Duplicate Reads=0 2014-01-25 09:49:11,712 [tabletserver.TabletServer] DEBUG: gc ParNew=897.13(+17.22) secs ConcurrentMarkSweep=6.22(+0.00) secs freemem=4,064,834,072(+1,745,404,552) totalmem=12,777,553,920 2014-01-25 09:49:17,455 [tabletserver.TabletServer] DEBUG: gc ParNew=901.79(+4.67) secs ConcurrentMarkSweep=6.22(+0.00) secs freemem=3,913,871,704(+667,572,648) totalmem=12,777,553,920 2014-01-25 09:49:17,871 [cache.LruBlockCache] DEBUG: Block cache LRU eviction started. Attempting to free 859003704 bytes 2014-01-25 09:49:17,942 [cache.LruBlockCache] DEBUG: Block cache LRU eviction completed. Freed 859019280 bytes. Priority Sizes: Single=2512.0676MB (2634093856), Multi=4456.495MB (4672973712),Memory=0.0MB (0) 2014-01-25 09:49:19,569 [tabletserver.TabletServer] DEBUG: gc ParNew=903.09(+1.30) secs ConcurrentMarkSweep=6.22(+0.00) secs freemem=3,733,299,800(-180,571,904) totalmem=12,777,553,920 2014-01-25 09:49:21,472 [tabletserver.TabletServer] DEBUG: gc ParNew=904.08(+0.99) secs ConcurrentMarkSweep=6.22(+0.00) secs freemem=3,574,451,336(-158,848,464) totalmem=12,777,553,920 2014-01-25 09:49:44,408 [tabletserver.TabletServer] DEBUG: Got loadTablet message from user: !SYSTEM 2014-01-25 09:49:44,408 [tabletserver.TabletServer] INFO : Loading tablet d;72~gcm~201304;72 2014-01-25 09:49:44,408 [tabletserver.TabletServer] INFO : dcloud18/192.168.2.233:9997: got assignment from master: d;72~gcm~201304;72 2014-01-25 09:49:44,408 [tabletserver.TabletServer] DEBUG: Loading extent: d;72~gcm~201304;72 2014-01-25 09:49:44,409 [tabletserver.TabletServer] DEBUG: verifying extent d;72~gcm~201304;72 2014-01-25 09:49:44,418 [tabletserver.TabletServer] DEBUG: Got loadTablet message from user: !SYSTEM 2014-01-25 09:49:44,418 [tabletserver.TabletServer] INFO : Loading tablet p;18~thm~2012101;18 2014-01-25 09:49:44,456 [tabletserver.TabletServer] DEBUG: Got loadTablet message from user: !SYSTEM 2014-01-25 09:49:44,557 [tabletserver.Tablet] DEBUG: Looking at metadata {d;72~gcm~201304 file:/b-00028xn/I0002i57.rf [] 40757550 false=1625685,0,1389849852187, d;72~gcm~201304 file:/b-00028xn/I0002i5f.rf [] 40757699 false=1592695,0,1389849860660, d;72~gcm~201304 file:/t-0001ect/C0001nrv.rf [] 40418333 false=178655073,3168593, d;72~gcm~201304 file:/t-0001rzt/C0001u7r.rf [] 40441613 false=232169298,4085196, d;72~gcm~201304 file:/t-0001rzt/C0002evg.rf [] 40708788 false=281016452,4939001, d;72~gcm~201304 file:/t-0001rzt/C0002j3b.rf [] 40750189 false=36970878,670358, d;72~gcm~201304 file:/t-0001rzt/C0002j5f.rf [] 40756562 false=22086620,394464, d;72~gcm~201304 future:343bc1fa155242c [] 41048300 false=192.168.2.233:9997, d;72~gcm~201304 last:3438da6a936264c [] 40756562 false=192.168.2.225:9997, d;72~gcm~201304 srv:dir [] 40418333 false=/t-0001rzt, d;72~gcm~201304 srv:flush [] 40825738 false=9, d;72~gcm~201304 srv:lock [] 40825738 false=tservers/192.168.2.223:9997/zlock-0000000001$243b1084d9e1273, d;72~gcm~201304 srv:time [] 40757699 false=M1389849860660, d;72~gcm~201304 ~tab:~pr [] 40418333 false=^A72} 2014-01-25 09:49:44,557 [tabletserver.Tablet] DEBUG: got [] for logs for d;72~gcm~201304;72 2014-01-25 09:49:44,557 [tabletserver.TabletServer] DEBUG: Got loadTablet message from user: !SYSTEM 2014-01-25 09:49:44,557 [zookeeper.ZooLock] DEBUG: Unexpected event watching lock node WatchedEvent state:Disconnected type:None path:null /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/tservers/192.168.2.233:9997/zlock-0000000000 2014-01-25 09:49:44,560 [tabletserver.TabletServer] DEBUG: gc ParNew=926.22(+22.14) secs ConcurrentMarkSweep=6.25(+0.03) secs freemem=3,477,953,072(-96,498,264) totalmem=12,777,553,920 2014-01-25 09:49:44,560 [tabletserver.TabletServer] DEBUG: Got loadTablet message from user: !SYSTEM 2014-01-25 09:50:09,984 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: Total=6630.77MB (6952866144), Free=1561.2301MB (1637068448), Max=8192.0MB (8589934592), Counts: Blocks=67662, Access=43916676, Hit=43363810, Miss=552866, Evictions=58, Evicted=485201, Ratios: Hit Ratio=98.74109625816345%, Miss Ratio=1.2588976882398129%, Evicted/Run=8365.5341796875, Duplicate Reads=0 2014-01-25 09:50:09,984 [cache.LruBlockCache] DEBUG: Cache Stats: Sizes: Total=513.6937MB (538646880), Free=1534.3063MB (1608836768), Max=2048.0MB (2147483648), Counts: Blocks=845, Access=30157409, Hit=30156512, Miss=897, Evictions=0, Evicted=0, Ratios: Hit Ratio=99.99703168869019%, Miss Ratio=0.0029743934646830894%, Evicted/Run=NaN, Duplicate Reads=18 2014-01-25 09:50:28,730 [zookeeper.ZooCache] WARN : Zookeeper error, will retry org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/f76cacfa-e117-4999-893a-1eba79920f2c/config/tserver.client.timeout at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:208)
          Hide
          ecn Eric Newton added a comment -

          Oh,. you're right. Do you have the end of the log for Server A? And, are the times across your system synchronized? I'm trying to establish how this happened, to see if there's another check we can add to mitigate it.

          Show
          ecn Eric Newton added a comment - Oh,. you're right. Do you have the end of the log for Server A? And, are the times across your system synchronized? I'm trying to establish how this happened, to see if there's another check we can add to mitigate it.
          Hide
          anthonyf Anthony Fox added a comment -

          The tservers do not have the same IP. Server A is 192.168.2.233 while server B is 192.168.2.223 - off by ten which is admittedly a bit confusing.

          Show
          anthonyf Anthony Fox added a comment - The tservers do not have the same IP. Server A is 192.168.2.233 while server B is 192.168.2.223 - off by ten which is admittedly a bit confusing.
          Hide
          ecn Eric Newton added a comment -

          In this case, a tablet was assigned to server A, and that server was considered down before the tablet was loaded. It was then assigned to server B. But both server A and B updated the location information. And both servers are running on the same computer (since they have the same IP).

          There are several mechanisms used to avoid this problem:

          1. Port number: two processes should not be able to open the same address on a unix-like system.
          2. zookeeper lock. Server B should not have been able to get a lock if Server A had the lock
          3. updates to the !METADATA table (which hold the assigned and current location) are protected by a constraint that verifies that the server still holds its lock
          4. the tablet server checks that the tablet is assigned to it before it loads the tablet

          Decoding the write-ahead logs of the !METADATA tablet would give us a clearer idea of what order things happened. If this happens again, please copy the !METADATA walog files for post-analysis. Look for these entries:

          shell> scan -b !0; -e !0< -c log
          

          For this to have happened, the master needed to see the zookeeper lock of A expire, read the old status of "assigned," the server B had to start and be noticed by the master, the old server had to write a last-gasp update the the !METADATA table and server serving the !METADATA table had to read a cached, old lock from zookeeper.

          Are you automatically restarting your tservers, by any chance?

          Show
          ecn Eric Newton added a comment - In this case, a tablet was assigned to server A, and that server was considered down before the tablet was loaded. It was then assigned to server B. But both server A and B updated the location information. And both servers are running on the same computer (since they have the same IP). There are several mechanisms used to avoid this problem: Port number: two processes should not be able to open the same address on a unix-like system. zookeeper lock. Server B should not have been able to get a lock if Server A had the lock updates to the !METADATA table (which hold the assigned and current location) are protected by a constraint that verifies that the server still holds its lock the tablet server checks that the tablet is assigned to it before it loads the tablet Decoding the write-ahead logs of the !METADATA tablet would give us a clearer idea of what order things happened. If this happens again, please copy the !METADATA walog files for post-analysis. Look for these entries: shell> scan -b !0; -e !0< -c log For this to have happened, the master needed to see the zookeeper lock of A expire, read the old status of "assigned," the server B had to start and be noticed by the master, the old server had to write a last-gasp update the the !METADATA table and server serving the !METADATA table had to read a cached, old lock from zookeeper. Are you automatically restarting your tservers, by any chance?

            People

            • Assignee:
              ecn Eric Newton
              Reporter:
              ecn Eric Newton
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development