Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-20103

Improve test coverage with chaos testing & side-by-side tests

    XMLWordPrintableJSON

Details

    Description

      This is a follow-up ticket after FLINK-20097.

      With the current setup (UnalignedITCase):

      • race conditions are not detected reliably (1 per tens of runs)
      • require changing the configuration (low checkpoint timeout)
      • adding a new job graph often reveals a new bug

      An additional issue with the current setup is that it's difficult to git bisect (for long ranges). 

      Changes that might hide the bugs:

      • having Preconditions in ChannelStatePersister (slow down processing)
      • some Preconditions may mask errors by causing job restart
      • timings in tests (UnalignedITCase)

       Some options to consider

      1. chaos monkey tests including induced latency and/or CPU bursts - on different workloads/configs
      2. side-by-side tests with randomized inputs/configs

      Extending Jepsen coverage further (validating output) does not seem promising in the context of Flink because it's output isn't linearisable.
       

      Some tools for (1) that could be used:

      1. https://github.com/chaosblade-io/chaosblade (docs need translation)
      2. https://github.com/Netflix/chaosmonkey - requires spinnaker (CD)

      3. jvm agent: https://github.com/mrwilson/byte-monkey
      4. https://vmware.github.io/mangle/ - supports java method latency; ui oriented?; not actively maintained?

       
       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              roman Roman Khachatryan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: