Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Distributed Exec
    • None
    • ghx-label-9

    Description

      When conducting large scale experiments on a 6 rack cluster with aggregator core network topology overall cluster bandwidth utilization was limited.

      With aggregator core networks nodes and racks are not equidistant, which means a broadcast operation can be inefficient as the broadcasting node needs to send the same data N times to each node on a remote rack.

      Ideally Rowbatches should be sent once per remote rack then a node on each remote rack would broadcast within its rack.

      Table below represent rack to rack latency for the 90% of operations, ration between best and worst case is 7.3x

        va vc vd1 vd3 ve
      va 4,238 4,290 9,692 8,897 8,208
      vc 9,290 4,396 30,952 13,529 14,578
      vd1 9,131 29,066 4,346 17,265 16,849
      vd3 7,409 15,517 17,265 4,370 4,687
      ve 4,914 16,894 16,430 4,713 4,472

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mmokhtar Mostafa Mokhtar
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: