[ACCUMULO-4157] WAL can be prematurely deleted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.6.5, 1.7.1
Fix Version/s: 1.6.6, 1.7.2
Component/s: gc
Labels:
None

Description

Ran into a situation where the master started logging Unable to initiate log sort because the WAL could not be found right after a tserver died. The WAL was for a tablet in the metadata table, with a key extent like !0;t;endrow;prevrow hosted by a tabletserver like tserver1. Doing a log sort happens in the master before a tablet can be reassigned and brought back online. Accumulo is in a really bad state when this happens as that tablet will stay unhosted until manual intervention. Luckily, the WAL was in the HDFS trash and could be copied back to the expected location.

Piecing together the logs showed something like the following

13:30:36 tserver1 minor compacted for that extent MinC finish lock 0.00 secs !0;t;endrow;prevrow, maybe removing the log entry from the metadata table
13:37:01 tserver1 data written, Adding 1 logs for extent !0;t;endrow;prevrow, maybe adding back that log entry to the metadata table
13:38:58 master thinks the server went down, Lost servers tserver1:9997, but it is really still up for another 8 mins
13:38:58 master identifies WAL to be recovered Recovering hdfs://accumulo/wal/tserver1+9997/UUID
13:38:58 master Loaded class : org.apache.accumulo.server.master.recovery.HadoopLogCloser
13:38:58 master Started recovery of hdfs://accumulo/wal/tserver1+9997/UUID, tablet !0;t;endrow;prevrow holds a reference
13:39:13 master Waiting for file to be closed hdfs://accumulo/wal/tserver1+9997/UUID
13:39:16 gc cleaning up WAL doesn't see reference, thinks server is down so wacks the WAL and logs Removing WAL for offline server hdfs://accumulo/wal/tserver1+9997/UUID
13:39:18 master gets FileNotFoundException for that WAL, Unable to initate log sort for hdfs://accumulo/wal/tserver1+9997/UUID . Accumulo is in a bad state from here until manual intervention.
13:44:16 tserver1 more data written, Adding 1 logs for extent !0;t;endrow;prevrow
13:45:45 tserver1 finally dies after logging Lost table server lock (reason = SESSION_EXPIRED)

This suggests that the GargabeCollectWriteAheadLog is too aggressively removing WALs for a server it thinks is dead but may actually still be doing work. The tserver was under heavy load before it went down.

Studying the logs with kturner and brainstorming, here are some things that could be fixed/checked

When gc doesn't see a reference to a WAL in metadata table, it asks the tablet server to delete the log. The gc process then logs at DEBUG that the WAL was deleted regardless of whether it was or not. Maybe change log to "asking tserver to delete WAL" or something. We found these messages in the gc log 45 minutes before this event. These messages were misleading because further investigation shows the tserver will log Deleting wal when a WAL is truly deleted. There were not such message in the tserver 45 min earlier, indicating the WAL was not actually deleted.
GC logs "Removing WAL for offline" at DEBUG. These can roll off pretty quickly, so change that to INFO. This will help keep history around longer to aid troubleshooting.
Verify the "adding 1 logs for extent" is using the srv:lock column to enforce the constraint. Looks like it is, but if zooLock is null in the update of MetadataTableUtil maybe badness is happening.
In GC, maybe keep a map of first time we see a tablet server is down and don't actually remove the WAL for offline tablet servers until they have been down an hour or something. Would need to make sure that map is cleared when a tserver comes back online.

Attachments

Issue Links

breaks

ACCUMULO-4428 GC does not delete WAL files belonging to dead tservers

Resolved

relates to

ACCUMULO-3772 WAL prematurely deleted for root table

Resolved

ACCUMULO-4333 Ensure WAL are not prematurely deleted in 1.8

Resolved

links to

107.patch

GitHub Pull Request #107

Activity

People

Assignee:: Michael Wall

Reporter:: Michael Wall

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 03/Mar/16 13:19

Updated:: 23/Feb/17 13:31

Resolved:: 08/Jun/16 13:33

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: