Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Labels:
      None

      Description

      Note: this ticket has become a ticket about improving the performance of the platform

      In order to track performance improvements, we need some reproducible performance benchmarks. Here are some ideas of what we'd need:

      • use PEs that do nothing but create a new message and forward. Allows us to focus on the overhead of the platform
      • what is the maximum throughput without dropping messages, in a given host (in a setup with 1 adapter node and 1 or 2 app nodes)
      • what is the latency for end to end processing (avg, median, etc...)
      • using a very simple app, with only 1 PE prototype
      • varying the number of keys
      • using a slightly more complex app (at least 2 communicating prototypes), in order to take into account inter-PE communications and related optimizations
      • start measurements after a warmup phase

      Some tests could be part of the test suite (by specifying a given option for those performance-related tests). That would allow some tracking of the performance.

      We could also add a simple injection mechanism that would work out of the box with the example bundled with new S4 apps (through "s4 newApp" command).

        Issue Links

          Activity

          Hide
          Matthieu Morel added a comment -

          Merged with dev branch in commit [2a95d35]

          Thanks for the feedback Aimee and Daniel!

          Note that currently the configuration of emitter and listener policies will be eased when integrating S4-59.

          Show
          Matthieu Morel added a comment - Merged with dev branch in commit [2a95d35] Thanks for the feedback Aimee and Daniel! Note that currently the configuration of emitter and listener policies will be eased when integrating S4-59 .
          Hide
          Daniel Gómez Ferro added a comment -

          +1, great work!

          Show
          Daniel Gómez Ferro added a comment - +1, great work!
          Hide
          Matthieu Morel added a comment -

          Just uploaded some more improvements in commit 6fe7ea8

          By default, PEs are executed using a load shedding stream executor (i.e. events are dropped when work queue is full) and a blocking sender.

          Other executors are also provided and can be used in replacement. For instance memory aware executor (from netty) and throttling executor (limits maximum rate for task submission). Throttling executor can be appropriate for throttling event sources for instance.

          It is also possible to use blocking stream executors, in order to prevent event loss, though depending on the design of the app, using both blocking sender and stream executor may lead to some deadlocks. When using a load shedding executor, event loss can be minimized in case of event bursts by using large work queues.

          Show
          Matthieu Morel added a comment - Just uploaded some more improvements in commit 6fe7ea8 By default, PEs are executed using a load shedding stream executor (i.e. events are dropped when work queue is full) and a blocking sender. Other executors are also provided and can be used in replacement. For instance memory aware executor (from netty) and throttling executor (limits maximum rate for task submission). Throttling executor can be appropriate for throttling event sources for instance. It is also possible to use blocking stream executors, in order to prevent event loss, though depending on the design of the app, using both blocking sender and stream executor may lead to some deadlocks. When using a load shedding executor, event loss can be minimized in case of event bursts by using large work queues.
          Hide
          Matthieu Morel added a comment -

          Uploaded more improvements for stability (sender backpressure) and configurability (pluggable deserialization executors, more overridable parameters).

          Also improved the benchmark applications (updated README file).

          I also experimented with netty 4 but there are still missing features in that version (buffer pooling, memory aware executors), so this version of S4 still uses netty 3

          These updates are available in commit 8117de6

          Show
          Matthieu Morel added a comment - Uploaded more improvements for stability (sender backpressure) and configurability (pluggable deserialization executors, more overridable parameters). Also improved the benchmark applications (updated README file). I also experimented with netty 4 but there are still missing features in that version (buffer pooling, memory aware executors), so this version of S4 still uses netty 3 These updates are available in commit 8117de6
          Hide
          Matthieu Morel added a comment -

          Patch update in commit cc2c4e8 fixes pending classloading issues.

          (there is still a transport layer configuration to improve: currently, the listener implementation is hardcoded as TCPListener).

          Show
          Matthieu Morel added a comment - Patch update in commit cc2c4e8 fixes pending classloading issues. (there is still a transport layer configuration to improve: currently, the listener implementation is hardcoded as TCPListener).
          Hide
          Matthieu Morel added a comment - - edited

          I just updated more improvements in commit f9689ea

          In addition to improvements from previous commits, we further and better decoupled the processing pipeline into asynchronous stages with configurable executors: reception, deserialization, processing, serialization and event emission.

          We can reach 250+ thousands remote events/s per node with a sufficient number of injectors using not the most recent hardware and OS, and that translates into 1M+ remote events per seconds with only 4 S4 nodes.

          Since remote events require going through the communication layer and the network, we expect even higher rates when combining with internal events as in a standard application. (Internal events are passed by reference).

          See the README.md file in s4-benchmarks subprojects for instructions to reproduce the tests.

          Show
          Matthieu Morel added a comment - - edited I just updated more improvements in commit f9689ea In addition to improvements from previous commits, we further and better decoupled the processing pipeline into asynchronous stages with configurable executors: reception, deserialization, processing, serialization and event emission. We can reach 250+ thousands remote events/s per node with a sufficient number of injectors using not the most recent hardware and OS, and that translates into 1M+ remote events per seconds with only 4 S4 nodes. Since remote events require going through the communication layer and the network, we expect even higher rates when combining with internal events as in a standard application. (Internal events are passed by reference). See the README.md file in s4-benchmarks subprojects for instructions to reproduce the tests.
          Hide
          Matthieu Morel added a comment -

          I slightly updated the benchmarking framework, and added several optimizations. They are included in this branch so that current performance work is centralized here.

          Current optimizations are mainly focusing on messaging:

          • serialization of messages now uses kryo 2 and in particular avoids previously identified deserialization bottleneck
          • serialized messages are passed around as byte buffers in order to limit copies (and there is room for more improvements here)
          • intermediate EventMessage instances have been removed

          My throughput measurements (messages/s), reproducible using the benchmarking framework, show improvements of up to 300% wrt the previous commit.

          Show
          Matthieu Morel added a comment - I slightly updated the benchmarking framework, and added several optimizations. They are included in this branch so that current performance work is centralized here. Current optimizations are mainly focusing on messaging: serialization of messages now uses kryo 2 and in particular avoids previously identified deserialization bottleneck serialized messages are passed around as byte buffers in order to limit copies (and there is room for more improvements here) intermediate EventMessage instances have been removed My throughput measurements (messages/s), reproducible using the benchmarking framework, show improvements of up to 300% wrt the previous commit.
          Hide
          Matthieu Morel added a comment -

          Uploaded a patch in branch S4-95. Includes scripts for running out of the box, on a local host or a cluster, and a very simple test application (currently the benchmark only evaluates inter-cluster communication overhead).

          Show
          Matthieu Morel added a comment - Uploaded a patch in branch S4-95 . Includes scripts for running out of the box, on a local host or a cluster, and a very simple test application (currently the benchmark only evaluates inter-cluster communication overhead).
          Hide
          Matthieu Morel added a comment -

          The benchmarking framework adds metrics across the platform codebase

          Show
          Matthieu Morel added a comment - The benchmarking framework adds metrics across the platform codebase

            People

            • Assignee:
              Matthieu Morel
              Reporter:
              Matthieu Morel
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development