Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-11871

[ML] IP resolver in TensorFlow cluster manager doesn't work properly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.7, 2.8
    • None
    • ml
    • None

    Description

      TensorFlow cluster manager requires NodeId to be resolved into IP address or hostname to pass the address/name to TensorFlow worker. Currently, it uses strategy "return first" and returns the first available address/name. As a result of that, in the case when the server has more than one interface cluster resolver might work incorrectly and return different addresses/names for the same server.

      To fix this problem we need to update TensorFlowServerAddressSpec so that it returns the same address/name for the same server all the time. If a server has multiple network interfaces we need to find a "GCD", a network with all Ignite nodes.

      Attachments

        Activity

          People

            zaleslaw Alexey Zinoviev
            zaleslaw Alexey Zinoviev
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: