[ACCUMULO-327] master lost all tablet servers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Component/s: tserver
Labels:
- 14_qa_bug
Environment:

running the random walk test on a small cluster

Description

Master would occasionally take a long time to collect status information from a tablet server. The connection would timeout after the default 120 second RPC time. This probably left the connection in a bad state because I am seeing

org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got 0
        at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:445)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_halt(TabletClientService.java:893)
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.halt(TabletClientService.java:876)

If the master is unable to collect statistics on the tablet server, it attempts to halt it (as above) and then it removes its lock in zookeeper.

Eventually, under the pressure of random walk operations, the master killed every tablet server.

Guess: a lock in the tablet server is delaying status reporting.

I wrote a script to process the master logs. It saves each line that refers to the IP address of a tablet server. When it sees the zookeeper lock has been deleted, it prints the last N lines that refer to that tablet server.

In 7 out of the 10 cases, a split timed out prior or during the status request failures.

In 5 cases, the tablet server was hosting the root tablet (a necessary condition when the last server died).

In 5 cases, the table_table info tablet was being hosted.

Attachments

Activity

People

Assignee:: Keith Turner

Reporter:: Eric C. Newton

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 19/Jan/12 15:50

Updated:: 09/Feb/12 22:24

Resolved:: 01/Feb/12 20:42