Uploaded image for project: 'Qpid Dispatch'
  1. Qpid Dispatch
  2. DISPATCH-1256

Throughput gets worse as more routers are added

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Creating a chain of routers gives worse throughput as the chain gets longer.

      Reproducer:

      Install quiver from here: https://github.com/alanconway/quiver (extened with --send-url)

      Run the attached script like this:

      qdr-chain.sh quiver --arrow c --count 1000 --send-url //:10001/testme //:10004/testme

      Sets up a chain of 4 routers, each connected to the previous. Waits for qdstat -nv at router 1 to show 4 routers in the topology. Receive messages from address 'testme' at router 4, send to address 'testme' at router 1.

      This runs the routers with enough tracing to run the dispatch scraper tool, which may affect the results. However 1 million message runs without tracing show a drop-off in throughput going from 1m0 (no router) up to 1m4 (4-router chain) - the more routers the worse it gets. This seems consistent with the credit delays seen in the traced case.

      1m0/out.txt:End-to-end rate ..................................... 296,824 messages/s
      1m0/out.txt: 50% ........ 2 ms 99.90% ........ 6 ms
      1m1/out.txt:End-to-end rate ..................................... 133,851 messages/s
      1m1/out.txt: 50% ........ 1 ms 99.90% ........ 3 ms
      1m2/out.txt:End-to-end rate ...................................... 59,011 messages/s
      1m2/out.txt: 50% ........ 2 ms 99.90% ........ 8 ms
      1m3/out.txt:End-to-end rate ...................................... 42,146 messages/s
      1m3/out.txt: 50% ........ 3 ms 99.90% ....... 11 ms
      1m4/out.txt:End-to-end rate ...................................... 32,415 messages/s
      1m4/out.txt: 50% ........ 4 ms 99.90% ....... 13 ms

      The output (config, logs and scrapes) from a 1000-message, 4-router tracing run is included, as well as the script to generate new data. Runs of 10k messages are feasible, but the HTML gets large and slow to view. From my experiments the results are similar.

      From file:scrape.html#c_connections the sender connection is starved of credit far more than any other connection: 20x more starvation instances, 40x longer total starvation and 100x longer wait for initial credit.

      Overall the flow of credit is slower than the flow of messages, and each additional router is reducing the throughput of credits, which reduces message throughput.

      Observations:

      • initial credit delay is excessive (100ms in this case when worst-case message latency is 33ms), something is wrong there.
      • inter-router link credit is small (250) compared to application credit (1000). Giving full credit to a sender creates a credit sharing problem,
        but we could give full credit to the inter-router links.
      • re-evaluate our credit sharing strategy to perform better in the common unshared case.

      Extracts from scrape:

      ◊ 2019-01-23 13:55:16.524477 D0#141 4 [D0_3] <- [] quiver-receiver-2ebb30e5 flow [0,0] (0,1000)
      ◊ 2019-01-23 13:55:16.633700 A0#287 1 [A0_6] -> [] quiver-sender-3db6ad3f flow [0,0] (0,250)
      Initial delay 109ms, much worse than worst-case message latency of 33ms

      ◊ 2019-01-23 13:55:16.639758 D0#153 4 [D0_3] <- [] quiver-receiver-2ebb30e5 flow [0,0] (3,1000)
      ◊ 2019-01-23 13:55:16.642682 A0#543 1 [A0_6] -> [] quiver-sender-3db6ad3f flow [0,0] (168,85)
      ◊ 2019-01-23 13:55:16.645205 D0#163 4 [D0_3] <- [] quiver-receiver-2ebb30e5 flow [0,0] (5,1000)
      ◊ 2019-01-23 13:55:16.645448 D0#167 4 [D0_3] <- [] quiver-receiver-2ebb30e5 flow [0,0] (6,1000)
      ◊ 2019-01-23 13:55:16.646510 A0#713 1 [A0_6] <- [] quiver-sender-3db6ad3f transfer [0,0] (252) transfer_281 - 158 bytes no more credit (accepted settled 0.028323 S)
      ◊ 2019-01-23 13:55:16.647142 D0#193 4 [D0_3] <- [] quiver-receiver-2ebb30e5 flow [0,0] (7,1000)
      ◊ 2019-01-23 13:55:16.647198 D0#195 4 [D0_3] <- [] quiver-receiver-2ebb30e5 flow [0,0] (10,1000)
      ◊ 2019-01-23 13:55:16.647915 D0#214 4 [D0_3] <- [] quiver-receiver-2ebb30e5 flow [0,0] (16,1000)
      ◊ 2019-01-23 13:55:16.648060 D0#217 4 [D0_3] <- [] quiver-receiver-2ebb30e5 flow [0,0] (25,1000)
      ◊ 2019-01-23 13:55:16.651294 D0#239 4 [D0_3] <- [] quiver-receiver-2ebb30e5 flow [0,0] (28,1000)
      ◊ 2019-01-23 13:55:16.652235 D0#257 4 [D0_3] <- [] quiver-receiver-2ebb30e5 flow [0,0] (41,1000)
      ◊ 2019-01-23 13:55:16.654294 A0#802 1 [A0_6] -> [] quiver-sender-3db6ad3f flow [0,0] (253,9) credit restored

      Here we see that the receiver has given 9 credit responses in the same time that the front router only gave 2. Either the router is slow to forward messages, or it's not giving back credit quickly when it does (or both) See flow.txt for the complete history of flow events at the sender and receiver, scrape.html for the complete history of everything.

        Attachments

        1. reproducer.tgz
          7.36 MB
          Alan Conway

          Activity

            People

            • Assignee:
              tedross Ted Ross
              Reporter:
              aconway Alan Conway
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: