[ACCUMULO-3249] New replication status message created for file that was already replicated - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.7.0
Component/s: replication
Labels:
None

Description

Noticed a failure in UnorderedWorkAssignerReplicationIT.dataWasReplicatedToThePeerWithoutDrain where the test timed out because a file never got replicated that we expected to.

Digging into it:

File was queued for replication before the original tserver died
New tserver picked up the file to be replicated before recovery fully completed
Tserver completed replication to the peer before recovery fully completed (recovery for metadata/replication succeeded, but not for all tables)
Master cleaned up replication records because it saw that the tserver recorded that replication was completed.
When recovery finally completed, it wrote an empty closed marker back into the metadata table (which is a precaution to make sure that know when a WAL is no longer referenced).

As such, we had a entry for a file that we thought needed replication but was already replicated. That's issue #1.

For some reason yet, this also caused the master to get into a state where it believe we needed to replicate the WAL but couldn't assign the WAL for replication (I believe the master thought it was already assigned for replication) and thus that file was stuck in a "pending-replication" phase and didn't proceed. Eventually the test timed out and failed.

Attachments

Activity

People

Assignee:: Josh Elser

Reporter:: Josh Elser

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 21/Oct/14 20:40

Updated:: 23/Oct/14 18:59

Resolved:: 23/Oct/14 18:59

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m