Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-458

Web interface has become slow, especially the job page

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.6.0
    • Component/s: UI
    • Labels:
      None
    • Sprint:
      Aurora Q3 Sprint 2

      Description

      The web interface is noticeably more sluggish since the revamp. This is most noticeable for large jobs, where the job page may display a blank page for several seconds before showing anything useful. We need to adapt the API to reduce the amount of data fetched to render these pages.

      1. Screen Shot 2014-05-22 at 11.42.24 AM.png
        227 kB
        David McLaughlin
      2. Screen Shot 2014-05-22 at 11.44.27 AM.png
        238 kB
        David McLaughlin
      3. scheduler-profile.csv
        334 kB
        David McLaughlin
      4. scheduler-profile.png
        138 kB
        David McLaughlin
      5. scheduler-profile-curl.csv
        243 kB
        David McLaughlin
      6. scheduler-profile-curl.png
        145 kB
        David McLaughlin

        Activity

        Hide
        davmclau David McLaughlin added a comment -

        This should now be fixed. The solution in the end was to go with the tasks endpoint without configs.

        Show
        davmclau David McLaughlin added a comment - This should now be fixed. The solution in the end was to go with the tasks endpoint without configs.
        Hide
        davmclau David McLaughlin added a comment - - edited

        So these profile runs show conclusively that GzipStream is the cause.

        This is timed output from a local run with no network latency:

        $ time curl -s 'http://localhost:8081/api' -H 'Accept-Encoding: gzip,deflate,sdch' --data-binary '[1,"getTasksStatus",1,0,{"1":{"rec":{"8":{"rec":{"1":{"str":"mesos"}}},"9":{"str":"test"},"2":{"str":"bigJob"}}}}]' --compressed > /tmp/results
        real  0m1.530s
        user  0m0.014s
        sys 0m0.011s
        
        
        $ time curl -s 'http://localhost:8081/api' -H 'Origin: http://localhost:8081' --data-binary '[1,"getTasksStatus",1,0,{"1":{"rec":{"8":{"rec":{"1":{"str":"mesos"}}},"9":{"str":"test"},"2":{"str":"bigJob"}}}}]' > /tmp/blah
        
        real  0m0.297s
        user  0m0.007s
        sys 0m0.015s
        

        As you can see, without compression it is 5x faster.

        With actual network latency (and a real production job with a much bigger payload - 10MB vs 3MB on local):

        $ time curl 'https://internal-scheduler/api' -H 'Accept-Encoding: gzip,deflate,sdch' --data-binary '[1,"getTasksStatus",1,0,{"1":{"rec":{"8":{"rec":{"1":{"str":"test"}}},"9":{"str":"prod"},"2":{"str":"bigJob"}}}}]' --compressed > /tmp/results
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed
        100  305k  100  305k  100   124  63172     25  0:00:04  0:00:04 --:--:-- 81652
        
        real  0m4.957s
        user  0m0.038s
        sys 0m0.024s
        
        
        $ time curl 'https://internal-scheduler/api' --data-binary '[1,"getTasksStatus",1,0,{"1":{"rec":{"8":{"rec":{"1":{"str":"test"}}},"9":{"str":"prod"},"2":{"str":"bigJob"}}}}]' > /tmp/results
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed
        100 10.3M  100 10.3M  100   124  3670k     42  0:00:02  0:00:02 --:--:-- 3684k
        
        real  0m2.904s
        user  0m0.192s
        sys 0m0.083s
        

        Still nearly twice as fast. So we should remove on the fly gzip compression for dynamic content.

        Show
        davmclau David McLaughlin added a comment - - edited So these profile runs show conclusively that GzipStream is the cause. This is timed output from a local run with no network latency: $ time curl -s 'http: //localhost:8081/api' -H 'Accept-Encoding: gzip,deflate,sdch' --data-binary '[1, "getTasksStatus" ,1,0,{ "1" :{ "rec" :{ "8" :{ "rec" :{ "1" :{ "str" : "mesos" }}}, "9" :{ "str" : "test" }, "2" :{ "str" : "bigJob" }}}}]' --compressed > /tmp/results real 0m1.530s user 0m0.014s sys 0m0.011s $ time curl -s 'http: //localhost:8081/api' -H 'Origin: http://localhost:8081' --data-binary '[1, "getTasksStatus" ,1,0,{ "1" :{ "rec" :{ "8" :{ "rec" :{ "1" :{ "str" : "mesos" }}}, "9" :{ "str" : "test" }, "2" :{ "str" : "bigJob" }}}}]' > /tmp/blah real 0m0.297s user 0m0.007s sys 0m0.015s As you can see, without compression it is 5x faster. With actual network latency (and a real production job with a much bigger payload - 10MB vs 3MB on local): $ time curl 'https: //internal-scheduler/api' -H 'Accept-Encoding: gzip,deflate,sdch' --data-binary '[1, "getTasksStatus" ,1,0,{ "1" :{ "rec" :{ "8" :{ "rec" :{ "1" :{ "str" : "test" }}}, "9" :{ "str" : "prod" }, "2" :{ "str" : "bigJob" }}}}]' --compressed > /tmp/results % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 305k 100 305k 100 124 63172 25 0:00:04 0:00:04 --:--:-- 81652 real 0m4.957s user 0m0.038s sys 0m0.024s $ time curl 'https: //internal-scheduler/api' --data-binary '[1, "getTasksStatus" ,1,0,{ "1" :{ "rec" :{ "8" :{ "rec" :{ "1" :{ "str" : "test" }}}, "9" :{ "str" : "prod" }, "2" :{ "str" : "bigJob" }}}}]' > /tmp/results % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 10.3M 100 10.3M 100 124 3670k 42 0:00:02 0:00:02 --:--:-- 3684k real 0m2.904s user 0m0.192s sys 0m0.083s Still nearly twice as fast. So we should remove on the fly gzip compression for dynamic content.
        Hide
        davmclau David McLaughlin added a comment -

        Attached a second profile run. This time I isolated the server from the browser by just running the same curl command that the UI does.

        Show
        davmclau David McLaughlin added a comment - Attached a second profile run. This time I isolated the server from the browser by just running the same curl command that the UI does.
        Hide
        davmclau David McLaughlin added a comment -

        Attached jvisualvm profiling of Scheduler for a request to "BigJob" (2000 active tasks).

        Show
        davmclau David McLaughlin added a comment - Attached jvisualvm profiling of Scheduler for a request to "BigJob" (2000 active tasks).
        Show
        davmclau David McLaughlin added a comment - I've posted a proposal for fixing this on the mailing list. Archive link: http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201405.mbox/%3CCAOOJoEwgXx2ZRUwNqC2augOyCxqm_97niccW2kUe36HPdWJ%2BeQ%40mail.gmail.com%3E
        Hide
        davmclau David McLaughlin added a comment -

        Just realised, I took a performance profile of the role page. The job page will need API changes in addition to the asynchronous thrift client.

        Show
        davmclau David McLaughlin added a comment - Just realised, I took a performance profile of the role page. The job page will need API changes in addition to the asynchronous thrift client.
        Hide
        davmclau David McLaughlin added a comment - - edited

        Attaching screenshots of the network debug tab for a large job. The thrift API response is around 1.25s, which we could certainly improve on but doesn't explain the 4~5s delay in rendering the page.

        I noticed that the thrift requests are happening serially, which would suggest the XMLHttpRequest operations are synchronous. This will block the entire browser and is most likely causing the brunt of the performance issues.

        Show
        davmclau David McLaughlin added a comment - - edited Attaching screenshots of the network debug tab for a large job. The thrift API response is around 1.25s, which we could certainly improve on but doesn't explain the 4~5s delay in rendering the page. I noticed that the thrift requests are happening serially, which would suggest the XMLHttpRequest operations are synchronous. This will block the entire browser and is most likely causing the brunt of the performance issues.

          People

          • Assignee:
            davmclau David McLaughlin
            Reporter:
            wfarner Bill Farner
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development

                Agile