Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      In our hadoop cluster, a reducer of a big job can utilize all the bandwidth during shuffle phase. Then any task reading data from the machine which running that reducer becomes very very slow.
      It's better to move DataTransferThrottler from hadoop-hdfs to hadoop-common. And create a throttler for Shuffle to throttle each Fetcher.

        Activity

        Liyin Liang created issue -
        Hide
        Sandy Ryza added a comment -

        Would the throttling go on the server (NodeManager) side or the client (reducer) side?

        Show
        Sandy Ryza added a comment - Would the throttling go on the server (NodeManager) side or the client (reducer) side?
        Hide
        Liyin Liang added a comment -

        Attach a ganglia network metics picture. The reducer utilizes all the input bandwidth. So the throttling should go on the reducer side.

        Show
        Liyin Liang added a comment - Attach a ganglia network metics picture. The reducer utilizes all the input bandwidth. So the throttling should go on the reducer side.
        Liyin Liang made changes -
        Field Original Value New Value
        Attachment ganglia-slave.jpg [ 12620122 ]
        Hide
        Sandy Ryza added a comment -

        In that case, adding network IO as a YARN resource and limiting it using cgroups might be a way to solve this problem as well.

        Show
        Sandy Ryza added a comment - In that case, adding network IO as a YARN resource and limiting it using cgroups might be a way to solve this problem as well.
        Hide
        Gera Shegalov added a comment -

        Liyin Liang As a short-term relief, is it possible for you to reduce mapreduce.reduce.shuffle.parallelcopies for this job.

        Show
        Gera Shegalov added a comment - Liyin Liang As a short-term relief, is it possible for you to reduce mapreduce.reduce.shuffle.parallelcopies for this job.
        Hide
        Liyin Liang added a comment -

        Sandy Ryza as a long-term work, limiting network IO using cgroups is a better way to solve this problem.
        Gera Shegalov Our cluster users run thousands of jobs every day. Its difficult for them to set parameters for specific job.

        Show
        Liyin Liang added a comment - Sandy Ryza as a long-term work, limiting network IO using cgroups is a better way to solve this problem. Gera Shegalov Our cluster users run thousands of jobs every day. Its difficult for them to set parameters for specific job.

          People

          • Assignee:
            Unassigned
            Reporter:
            Liyin Liang
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development