Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-3320

Replication reference prematurely closed and removed.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.7.0
    • replication
    • None

    Description

      Saw the following situation in MultiInstanceReplicationIT across GC, tserver and master. 03b6dad5-56c9-4f35-8daf-a444f3252038 appeared to never get cleaned up:

      2014-11-08 12:05:34,898 [tserver.Tablet] DEBUG: Logs to be destroyed: 2<< tserver:37732/file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038
      2014-11-08 12:05:34,904 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.47.246:52263 !0 0 entries in 0.03 secs, nbTimes = [25 25 25.00 1] 
      2014-11-08 12:05:34,914 [tserver.Tablet] DEBUG: Recording that data has been ingested into 2<< using [file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038, file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/8866f067-ed63-46ec-9dd1-b8d2e8381af2]
      2014-11-08 12:05:34,914 [util.ReplicationTableUtil] DEBUG: Updating replication status for 2<< with [file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038, file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/8866f067-ed63-46ec-9dd1-b8d2e8381af2] using [begin: 0 end: 0 infiniteEnd: true closed: false]
      
      2014-11-08 12:05:35,008 [replication.CloseWriteAheadLogReferences] INFO : Found 1 WALs referenced in metadata in 100.3 ms
      2014-11-08 12:05:35,127 [replication.CloseWriteAheadLogReferences] DEBUG: Closing unreferenced WAL (~replfile:/.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038 stat:2 [] 5 false) in metadata table
      2014-11-08 12:05:35,221 [replication.CloseWriteAheadLogReferences] INFO : Closed 1 WAL replication references in replication table in 206.2 ms
      
      2014-11-08 12:05:36,505 [tserver.Tablet] DEBUG: Logs to be destroyed: !0<;~ ip-172-31-47-246:37732/file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038
      
      2014-11-08 12:05:36,517 [replication.StatusMaker] DEBUG: Creating replication status record for file:/.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038 on table 2 with [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: 1415448333352].
      
      2014-11-08 12:06:27,097 [util.ReplicationTableUtil] DEBUG: Updating replication status for 2<< with [file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038] using [begin: 0 end: 0 infiniteEnd: true closed: false]
      

      This is problematic due to cross-process interaction. The GC happened to run just after the tserver performed a compaction and removed the log reference from the tablet. Thus, at a very small point in time, the WAL was not referenced by any tablets in the metadata table, and, as such, the GC "closed" the WAL replication reference.

      The master saw that it was closed, cleaned up the reference and started replication. The tserver continued to use the WAL (as it does) and placed some more updates into metadata.

      The other problem is that when the master saw that replication of the file was completed, it removed the references from the replication table (as expected). However, when it went to the next round of StatusMaker, it made a new record which had lost the fact that the old file had been fully replicated.

      Attachments

        Activity

          People

            elserj Josh Elser
            elserj Josh Elser
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m