[ACCUMULO-4777] Root tablet got spammed with 1.8 million log entries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.8.1
Fix Version/s: 1.7.4, 1.9.0
Component/s: None
Labels:
- pull-request-available

Description

We had a tserver that was handling accumulo.metadata tablets that somehow got into a loop where it created over 22K empty wal logs. There were around 70 metadata tablets and this resulted in around 1.8 million log entries in added to the accumulo.root table. The only reason it stopped creating wal logs is because it ran out of open file handles. This took us many hours and cups of coffee to clean up.

The log contained the following messages in a tight loop:

log.TabletServerLogger INFO : Using next log hdfs://...
tserver.TabletServfer INFO : Writing log marker for hdfs://...
tserver.TabletServer INFO : Marking hdfs://... closed
log.DfsLogger INFO : Slow sync cost ...
...

Unfortunately we did not have DEBUG turned on so we have no debug messages.

Tracking through the code there are three places where the TabletServerLogger.close method is called:
1) via resetLoggers in the TabletServerLogger, but nothing calls this method so this is ruled out
2) when the log gets too large or too old, but neither of those checks should have been hitting here.
3) In a loop that is executed (while (!success)) in the TabletServerLogger.write method. In this case when we unsuccessfullty write something to the wal, then that one is closed and a new one is created. This loop will go forever until we successfully write out the entry. A DfsLogger.LogClosedException seems the most logical reason. This is most likely because a ClosedChannelException was thrown from the DfsLogger.write methods (around line 609 in DfsLogger).

So the root cause was most likely hadoop related. However in accumulo we probably should not be doing a tight retry loop around a hadoop failure. I recommend at a minimum doing some sort of exponential back off and perhaps setting a limit on the number of retries resulting in a critical tserver failure.

Attachments

Issue Links

causes

ACCUMULO-4832 Seeing warnings when write ahead log changes.

Resolved

ACCUMULO-4847 Fix Retry API

Resolved

is related to

ACCUMULO-4780 Add overflow check to sequence number in CommitSession

Resolved

links to

GitHub Pull Request #355

GitHub Pull Request #356

GitHub Pull Request #368

(1 links to)

Activity

People

Assignee:: Ivan Bella

Reporter:: Ivan Bella

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 05/Jan/18 16:47

Updated:: 23/Apr/19 16:12

Resolved:: 01/Feb/18 23:31

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: