Traffic Server
  1. Traffic Server
  2. TS-1222

single tcp connection will limit the cluster throughput

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.3
    • Fix Version/s: 3.1.4
    • Component/s: Clustering
    • Labels:
      None

      Description

      kuotai is trying to work around the single tcp performance issue in cluster throughput. hopes we can fix it before 3.2 is out, put it a target for v3.3 for now.

      more detail, please take a look at the Clustering in projects on the cwiki.

        Activity

        Hide
        Zhao Yongming added a comment -

        we can expect kuotai to release the first version of the patch in this weekend, for review. and maybe we can get another improved version in the next week

        Show
        Zhao Yongming added a comment - we can expect kuotai to release the first version of the patch in this weekend, for review. and maybe we can get another improved version in the next week
        Hide
        Leif Hedstrom added a comment -

        We might want to consider moving this to 3.3.0, I'm hoping 3.1.4 will be the last dev release, and then onto 3.2. 3.1.5 is only for "crasher" bugs (and hopefully we will have none), not new features.

        Show
        Leif Hedstrom added a comment - We might want to consider moving this to 3.3.0, I'm hoping 3.1.4 will be the last dev release, and then onto 3.2. 3.1.5 is only for "crasher" bugs (and hopefully we will have none), not new features.
        Hide
        Zhao Yongming added a comment -

        first patch that implement multi-tcp connections in cluster, that will give you an options for balance the connections between ET_CLUSTER threads, that is more cpus you can use, for cluster traffic, IE, you have 10 boxes and each box have 32+cores.

        this patch had been testing for heavy cluster traffic in our testing bed, it rocks.

        the ongoing task we are working:
        1, find out how to balance ET_NET & ET_CLUSTER:
        1.1 should we keep each ET_CLUSTER(thread) one connection to every cluster member?
        1.2 why the highest & lowest IP in the cluster have so much diff in CPU usage, IE, the ET_CLUSTER in highest IP do not use any CPU, is that a problem involved in steal threading?

        2, how to deal with pre-cluster content caching?

        hopes we can resolve #1 task before V3.2

        I need you review for this patch, as it will change the interface we do cluster operations, IE, change from ClusterMachine to ClusterHandler, because now ClusterHandler:ClusterMachine is now M:1. thanks

        Show
        Zhao Yongming added a comment - first patch that implement multi-tcp connections in cluster, that will give you an options for balance the connections between ET_CLUSTER threads, that is more cpus you can use, for cluster traffic, IE, you have 10 boxes and each box have 32+cores. this patch had been testing for heavy cluster traffic in our testing bed, it rocks. the ongoing task we are working: 1, find out how to balance ET_NET & ET_CLUSTER: 1.1 should we keep each ET_CLUSTER(thread) one connection to every cluster member? 1.2 why the highest & lowest IP in the cluster have so much diff in CPU usage, IE, the ET_CLUSTER in highest IP do not use any CPU, is that a problem involved in steal threading? 2, how to deal with pre-cluster content caching? hopes we can resolve #1 task before V3.2 I need you review for this patch, as it will change the interface we do cluster operations, IE, change from ClusterMachine to ClusterHandler, because now ClusterHandler:ClusterMachine is now M:1. thanks
        Hide
        Zhao Yongming added a comment -

        the new patch from Kuotai, will fix the cpu balance issue in cluster, the request from cluster will use ET_CLUSTER for cache read&write, do not mess up with ET_NET any more.

        from our point, this is a really working patch that can be used in heavy cluster env, please review if you interest in clustering, I will commit it in days if I don't hear any problem

        Show
        Zhao Yongming added a comment - the new patch from Kuotai, will fix the cpu balance issue in cluster, the request from cluster will use ET_CLUSTER for cache read&write, do not mess up with ET_NET any more. from our point, this is a really working patch that can be used in heavy cluster env, please review if you interest in clustering, I will commit it in days if I don't hear any problem
        Hide
        Zhao Yongming added a comment -

        here is the top report, with 20 ET_NETs + 10 ET_CLUSTERs, @13kqps * 2 nodes:

        [root@test223 tmp]# top -b -H -n 1 |more
        top - 02:36:23 up  8:52,  1 user,  load average: 7.73, 7.26, 7.07
        Tasks: 530 total,  12 running, 518 sleeping,   0 stopped,   0 zombie
        Cpu(s): 20.3%us,  5.5%sy,  0.0%ni, 66.8%id,  4.9%wa,  0.0%hi,  2.6%si,  0.0%st
        Mem:  24570208k total,  8605744k used, 15964464k free,     7780k buffers
        Swap:  2097144k total,     7432k used,  2089712k free,    28452k cached
        
          PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
        16075 ats       20   0 11.9g 7.6g 5016 D 53.6 32.5   9:36.31 [ET_CLUSTER 3]
        16076 ats       20   0 11.9g 7.6g 5016 D 44.3 32.5   7:39.06 [ET_CLUSTER 4]
        16073 ats       20   0 11.9g 7.6g 5016 R 40.6 32.5   7:26.82 [ET_CLUSTER 1]
        16080 ats       20   0 11.9g 7.6g 5016 D 40.6 32.5   7:21.94 [ET_CLUSTER 8]
        16020 ats       20   0 11.9g 7.6g 5016 R 38.8 32.5   6:25.31 [ET_NET 2]
        16074 ats       20   0 11.9g 7.6g 5016 R 38.8 32.5   7:03.35 [ET_CLUSTER 2]
        16028 ats       20   0 11.9g 7.6g 5016 S 36.9 32.5   6:16.60 [ET_NET 10]
        16033 ats       20   0 11.9g 7.6g 5016 R 35.1 32.5   6:23.01 [ET_NET 15]
        16037 ats       20   0 11.9g 7.6g 5016 R 33.2 32.5   6:24.99 [ET_NET 19]
        16026 ats       20   0 11.9g 7.6g 5016 S 31.4 32.5   6:24.66 [ET_NET 8]
        16022 ats       20   0 11.9g 7.6g 5016 S 29.6 32.5   6:33.72 [ET_NET 4]
        16072 ats       20   0 11.9g 7.6g 5016 D 27.7 32.5   4:41.45 [ET_CLUSTER 0]
        16024 ats       20   0 11.9g 7.6g 5016 R 25.9 32.5   6:19.70 [ET_NET 6]
        16025 ats       20   0 11.9g 7.6g 5016 S 25.9 32.5   6:22.40 [ET_NET 7]
        16027 ats       20   0 11.9g 7.6g 5016 S 25.9 32.5   6:20.17 [ET_NET 9]
        16032 ats       20   0 11.9g 7.6g 5016 S 25.9 32.5   6:26.90 [ET_NET 14]
        16034 ats       20   0 11.9g 7.6g 5016 R 25.9 32.5   6:33.76 [ET_NET 16]
        16035 ats       20   0 11.9g 7.6g 5016 S 25.9 32.5   6:35.37 [ET_NET 17]
        16077 ats       20   0 11.9g 7.6g 5016 R 25.9 32.5   4:22.31 [ET_CLUSTER 5]
        16078 ats       20   0 11.9g 7.6g 5016 R 25.9 32.5   4:24.50 [ET_CLUSTER 6]
        16079 ats       20   0 11.9g 7.6g 5016 D 25.9 32.5   4:25.54 [ET_CLUSTER 7]
        16081 ats       20   0 11.9g 7.6g 5016 S 25.9 32.5   4:38.39 [ET_CLUSTER 9]
        16019 ats       20   0 11.9g 7.6g 5016 S 24.0 32.5   6:25.65 [ET_NET 1]
        16029 ats       20   0 11.9g 7.6g 5016 S 24.0 32.5   6:22.07 [ET_NET 11]
        16031 ats       20   0 11.9g 7.6g 5016 R 24.0 32.5   6:27.21 [ET_NET 13]
        16036 ats       20   0 11.9g 7.6g 5016 S 24.0 32.5   6:28.33 [ET_NET 18]
        16021 ats       20   0 11.9g 7.6g 5016 S 22.2 32.5   6:18.55 [ET_NET 3]
        16017 ats       20   0 11.9g 7.6g 5016 S 20.3 32.5   6:41.56 [ET_NET 0]
        16030 ats       20   0 11.9g 7.6g 5016 R 20.3 32.5   6:26.94 [ET_NET 12]
        16023 ats       20   0 11.9g 7.6g 5016 S 18.5 32.5   6:21.96 [ET_NET 5]
        16608 root      20   0 15344 1460  832 R  9.2  0.0   0:00.08 top          
        
        Show
        Zhao Yongming added a comment - here is the top report, with 20 ET_NETs + 10 ET_CLUSTERs, @13kqps * 2 nodes: [root@test223 tmp]# top -b -H -n 1 |more top - 02:36:23 up 8:52, 1 user, load average: 7.73, 7.26, 7.07 Tasks: 530 total, 12 running, 518 sleeping, 0 stopped, 0 zombie Cpu(s): 20.3%us, 5.5%sy, 0.0%ni, 66.8%id, 4.9%wa, 0.0%hi, 2.6%si, 0.0%st Mem: 24570208k total, 8605744k used, 15964464k free, 7780k buffers Swap: 2097144k total, 7432k used, 2089712k free, 28452k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16075 ats 20 0 11.9g 7.6g 5016 D 53.6 32.5 9:36.31 [ET_CLUSTER 3] 16076 ats 20 0 11.9g 7.6g 5016 D 44.3 32.5 7:39.06 [ET_CLUSTER 4] 16073 ats 20 0 11.9g 7.6g 5016 R 40.6 32.5 7:26.82 [ET_CLUSTER 1] 16080 ats 20 0 11.9g 7.6g 5016 D 40.6 32.5 7:21.94 [ET_CLUSTER 8] 16020 ats 20 0 11.9g 7.6g 5016 R 38.8 32.5 6:25.31 [ET_NET 2] 16074 ats 20 0 11.9g 7.6g 5016 R 38.8 32.5 7:03.35 [ET_CLUSTER 2] 16028 ats 20 0 11.9g 7.6g 5016 S 36.9 32.5 6:16.60 [ET_NET 10] 16033 ats 20 0 11.9g 7.6g 5016 R 35.1 32.5 6:23.01 [ET_NET 15] 16037 ats 20 0 11.9g 7.6g 5016 R 33.2 32.5 6:24.99 [ET_NET 19] 16026 ats 20 0 11.9g 7.6g 5016 S 31.4 32.5 6:24.66 [ET_NET 8] 16022 ats 20 0 11.9g 7.6g 5016 S 29.6 32.5 6:33.72 [ET_NET 4] 16072 ats 20 0 11.9g 7.6g 5016 D 27.7 32.5 4:41.45 [ET_CLUSTER 0] 16024 ats 20 0 11.9g 7.6g 5016 R 25.9 32.5 6:19.70 [ET_NET 6] 16025 ats 20 0 11.9g 7.6g 5016 S 25.9 32.5 6:22.40 [ET_NET 7] 16027 ats 20 0 11.9g 7.6g 5016 S 25.9 32.5 6:20.17 [ET_NET 9] 16032 ats 20 0 11.9g 7.6g 5016 S 25.9 32.5 6:26.90 [ET_NET 14] 16034 ats 20 0 11.9g 7.6g 5016 R 25.9 32.5 6:33.76 [ET_NET 16] 16035 ats 20 0 11.9g 7.6g 5016 S 25.9 32.5 6:35.37 [ET_NET 17] 16077 ats 20 0 11.9g 7.6g 5016 R 25.9 32.5 4:22.31 [ET_CLUSTER 5] 16078 ats 20 0 11.9g 7.6g 5016 R 25.9 32.5 4:24.50 [ET_CLUSTER 6] 16079 ats 20 0 11.9g 7.6g 5016 D 25.9 32.5 4:25.54 [ET_CLUSTER 7] 16081 ats 20 0 11.9g 7.6g 5016 S 25.9 32.5 4:38.39 [ET_CLUSTER 9] 16019 ats 20 0 11.9g 7.6g 5016 S 24.0 32.5 6:25.65 [ET_NET 1] 16029 ats 20 0 11.9g 7.6g 5016 S 24.0 32.5 6:22.07 [ET_NET 11] 16031 ats 20 0 11.9g 7.6g 5016 R 24.0 32.5 6:27.21 [ET_NET 13] 16036 ats 20 0 11.9g 7.6g 5016 S 24.0 32.5 6:28.33 [ET_NET 18] 16021 ats 20 0 11.9g 7.6g 5016 S 22.2 32.5 6:18.55 [ET_NET 3] 16017 ats 20 0 11.9g 7.6g 5016 S 20.3 32.5 6:41.56 [ET_NET 0] 16030 ats 20 0 11.9g 7.6g 5016 R 20.3 32.5 6:26.94 [ET_NET 12] 16023 ats 20 0 11.9g 7.6g 5016 S 18.5 32.5 6:21.96 [ET_NET 5] 16608 root 20 0 15344 1460 832 R 9.2 0.0 0:00.08 top
        Hide
        Leif Hedstrom added a comment -

        please land this bad boy already.

        Show
        Leif Hedstrom added a comment - please land this bad boy already.
        Hide
        Zhao Yongming added a comment -

        in e67b0e4038d9144c02cf41d6bf06e42083295a52

        and will file more issues for later improvement

        Show
        Zhao Yongming added a comment - in e67b0e4038d9144c02cf41d6bf06e42083295a52 and will file more issues for later improvement

          People

          • Assignee:
            Bin Chen
            Reporter:
            Zhao Yongming
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development