Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21505

Several inconsistencies on information reported for Replication Sources by hbase shell status 'replication' command.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0-alpha-1, 2.2.0
    • 3.0.0-alpha-1, 2.3.0
    • Replication
    • None
    • Hide
      This modifies "status 'replication'" output, fixing inconsistencies on the reporting times and ages of last shipped edits, as well as wrong calculation of replication lags.

      It also introduces additional info for each recovery queue, which was not accounted by this command before.

      The new output for "status 'replication'" command is explained in details below:
      a) Source started, target stopped, no edits arrived on source yet:
      ...
       SOURCE: PeerID=1
               Normal Queue: 1
                 No Ops shipped since last restart, SizeOfLogQueue=1, No edits for this source since it started, Replication Lag=0
      ...
      b) Source started, target stopped, add edit on source:
      ...
      Normal Queue: 1
                 No Ops shipped since last restart, SizeOfLogQueue=1, TimeStampOfLastArrivedInSource=Wed Nov 21 07:21:00 GMT 2018, Replication Lag=2459
      ...
      c) Source started, target stopped, edit added on source, restart source:
      ...
      SOURCE: PeerID=1
               Normal Queue: 1
                 No Ops shipped since last restart, SizeOfLogQueue=1, No edits for this source since it started, Replication Lag=0
               Recovered Queue: 1-hbase01.home,16020,1542784524057
                 No Ops shipped since last restart, SizeOfLogQueue=1, TimeStampOfLastArrivedInSource=Wed Nov 21 07:23:00 GMT 2018, Replication Lag=201495
      ...
      d) Source started, target stopped, add edit on source, restart source, add another edit on source:
      ...
      SOURCE: PeerID=1
               Normal Queue: 1
                 No Ops shipped since last restart, SizeOfLogQueue=1, TimeStampOfLastArrivedInSource=Wed Nov 21 07:02:28 GMT 2018, Replication Lag=6349
               Recovered Queue: 1-hbase01.home,16020,1542782758742
                 No Ops shipped since last restart, SizeOfLogQueue=0, TimeStampOfLastArrivedInSource=Wed Nov 21 06:53:05 GMT 2018, Replication Lag=569394
      ...
      e) Source started, target stopped, add edit on source, restart source, add another edit on source, start target:
      ...
             SOURCE: PeerID=1
               Normal Queue: 1
                 AgeOfLastShippedOp=30000, TimeStampOfLastShippedOp=Wed Nov 21 07:07:58 GMT 2018, SizeOfLogQueue=1, TimeStampOfLastArrivedInSource=Wed Nov 21 07:02:28 GMT 2018, Replication Lag=0
      ...
      f) Source started, target stopped, add edit on source, restart source, restart target:
      ...
      SOURCE: PeerID=1
               Normal Queue: 1
                 No Ops shipped since last restart, SizeOfLogQueue=1, No edits for this source since it started, Replication Lag=0
      ...
      Show
      This modifies "status 'replication'" output, fixing inconsistencies on the reporting times and ages of last shipped edits, as well as wrong calculation of replication lags. It also introduces additional info for each recovery queue, which was not accounted by this command before. The new output for "status 'replication'" command is explained in details below: a) Source started, target stopped, no edits arrived on source yet: ...  SOURCE: PeerID=1          Normal Queue: 1            No Ops shipped since last restart, SizeOfLogQueue=1, No edits for this source since it started, Replication Lag=0 ... b) Source started, target stopped, add edit on source: ... Normal Queue: 1            No Ops shipped since last restart, SizeOfLogQueue=1, TimeStampOfLastArrivedInSource=Wed Nov 21 07:21:00 GMT 2018, Replication Lag=2459 ... c) Source started, target stopped, edit added on source, restart source: ... SOURCE: PeerID=1          Normal Queue: 1            No Ops shipped since last restart, SizeOfLogQueue=1, No edits for this source since it started, Replication Lag=0          Recovered Queue: 1-hbase01.home,16020,1542784524057            No Ops shipped since last restart, SizeOfLogQueue=1, TimeStampOfLastArrivedInSource=Wed Nov 21 07:23:00 GMT 2018, Replication Lag=201495 ... d) Source started, target stopped, add edit on source, restart source, add another edit on source: ... SOURCE: PeerID=1          Normal Queue: 1            No Ops shipped since last restart, SizeOfLogQueue=1, TimeStampOfLastArrivedInSource=Wed Nov 21 07:02:28 GMT 2018, Replication Lag=6349          Recovered Queue: 1-hbase01.home,16020,1542782758742            No Ops shipped since last restart, SizeOfLogQueue=0, TimeStampOfLastArrivedInSource=Wed Nov 21 06:53:05 GMT 2018, Replication Lag=569394 ... e) Source started, target stopped, add edit on source, restart source, add another edit on source, start target: ...        SOURCE: PeerID=1          Normal Queue: 1            AgeOfLastShippedOp=30000, TimeStampOfLastShippedOp=Wed Nov 21 07:07:58 GMT 2018, SizeOfLogQueue=1, TimeStampOfLastArrivedInSource=Wed Nov 21 07:02:28 GMT 2018, Replication Lag=0 ... f) Source started, target stopped, add edit on source, restart source, restart target: ... SOURCE: PeerID=1          Normal Queue: 1            No Ops shipped since last restart, SizeOfLogQueue=1, No edits for this source since it started, Replication Lag=0 ...

    Description

      While reviewing hbase shell status 'replication' command, noticed the following issues related to replication source section:

      1) TimeStampsOfLastShippedOp keeps getting updated and increasing even when no new edits were added to source, so nothing was really shipped. Test steps performed:
      1.1) Source cluster with only one table targeted to replication;
      1.2) Added a new row, confirmed the row appeared in Target cluster;
      1.3) Issued status 'replication' command in source, TimeStampsOfLastShippedOp shows current timestamp T1.
      1.4) Waited 30 seconds, no new data added to source. Issued status 'replication' command, now shows timestamp T2.

      2) When replication is stuck due some connectivity issues or target unavailability, if new edits are added in source, reported AgeOfLastShippedOp is wrongly showing same value as "Replication Lag". This is incorrect, AgeOfLastShippedOp should not change until there's indeed another edit shipped to target. Test steps performed:
      2.1) Source cluster with only one table targeted to replication;
      2.2) Stopped target cluster RS;
      2.3) Put a new row on source. Running status 'replication' command does show lag increasing. TimeStampsOfLastShippedOp seems correct also, no further updates as described on bullet #1 above.
      2.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even though there's no new edit shipped to target:

      ...
       SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
      ...
      
      ...
      SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
      ...
      

      3) AgeOfLastShippedOp gets set to 0 even when a given edit had taken some time before it got finally shipped to target. Test steps performed:
      3.1) Source cluster with only one table targeted to replication;
      3.2) Stopped target cluster RS;
      3.3) Put a new row on source.
      3.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even though there's no new edit shipped to target:

      T1:
      ...
       SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
      ...
      T2:
      ...
      SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
      ...
      

      3.5) Restart target cluster RS and verified the new row appeared there. No new edit added, but status 'replication' command reports AgeOfLastShippedOp as 0, while it should be the diff between the time it concluded shipping at target and the time it was added in source:

      SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=0
      

      4) When replication is stuck due some connectivity issues or target unavailability, if RS is restarted, once recovered queue source is started, TimeStampsOfLastShippedOp is set to initial java date (Thu Jan 01 01:00:00 GMT 1970, for example), thus "Replication Lag" also gives a complete inaccurate value.
      Tests performed:
      4.1) Source cluster with only one table targeted to replication;
      4.2) Stopped target cluster RS;
      4.3) Put a new row on source, restart RS on source, waited a few seconds for recovery queue source to startup, then it gives:

      SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, TimeStampsOfLastShippedOp=Thu Jan 01 01:00:00 GMT 1970, Replication Lag=9223372036854775807
      

      Also, we should report status to all sources running, current output format gives the impression there’s only one, even when there are recovery queues, for instance.

      Here is a list of ideas on how the command should report under different states of replication:
      a) Source started, target stopped, no edits arrived on source yet:
      Status replication should not show any lags, no edits shipped, no edits arrived;

      b) Source started, target stopped, add edit on source:
      Status replication should report following info -> lag, time of edit arrival on source, additional message saying no edits had been shipped to target;

      c) Source started, target stopped, edit added on source, restart source:
      Status replication should list two sources, one normal, other recovered. Normal source should show no lags, no edits shipped, no edits arrived. Recovered should show no edits shipped, but should have edits arrived in source and lag > 0;

      d) Source started, target stopped, add edit on source, restart source, add another edit on source:
      Status replication should list two sources, one normal, other recovered. Both sources should show no edits shipped, but should have edits arrived in source and lag > 0;

      e) Source started, target stopped, add edit on source, restart source, add another edit on source, start target:
      Status replication should list normal source only (after some short period), with proper times for last shipped, last arrived in source and no replication lag.

      f) Source started, target stopped, add edit on source, restart source, restart target:
      Status replication should list normal source only, with no shipped, nor arrived edits, and lag should be 0;

      Attachments

        1. HBASE-21505-master.011.patch
          72 kB
          Wellington Chevreuil
        2. HBASE-21505-master.011.patch
          72 kB
          Jingyun Tian
        3. HBASE-21505-master.010.patch
          73 kB
          Wellington Chevreuil
        4. HBASE-21505-master.009.patch
          72 kB
          Wellington Chevreuil
        5. HBASE-21505-master.008.patch
          71 kB
          Wellington Chevreuil
        6. HBASE-21505-master.007.patch
          69 kB
          Wellington Chevreuil
        7. HBASE-21505-master.006.patch
          66 kB
          Wellington Chevreuil
        8. HBASE-21505-master.005.patch
          66 kB
          Wellington Chevreuil
        9. HBASE-21505-master.004.patch
          65 kB
          Wellington Chevreuil
        10. HBASE-21505-master.003.patch
          65 kB
          Wellington Chevreuil
        11. HBASE-21505-master.002.patch
          66 kB
          Wellington Chevreuil
        12. HBASE-21505-master.001.patch
          67 kB
          Wellington Chevreuil
        13. HBASE-21505-branch-2.002.patch
          73 kB
          Wellington Chevreuil
        14. HBASE-21505-branch-2.002.patch
          73 kB
          Jingyun Tian
        15. HBASE-21505-branch-2.001.patch
          73 kB
          Wellington Chevreuil
        16. HBASE-21505-branch-2.001.patch
          73 kB
          Jingyun Tian
        17. 0001-HBASE-21505-initial-version-for-more-detailed-report.patch
          41 kB
          Wellington Chevreuil

        Issue Links

          Activity

            People

              wchevreuil Wellington Chevreuil
              wchevreuil Wellington Chevreuil
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: