Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-33538

Create a "best match" option for CEP when using the optional operator in the Pattern API

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Library / CEP
    • None

    Description

      I’ll try to detail my issue and the initial experimentation I’ve done in the CEP source code to show the potential of making this nearly as fast as having no optional events in the pattern.

       

      Using Flink CEP I have implemented a pattern that detects a sequence of events. Each event has internal attributes that are checked and inter-event time deltas are also checked as part of the pattern.

       

      When I require all events in my sequence (say 10 elements long) Flink CEP works well and is super fast at detecting the “perfect” sequences. Should one or more of the events not get recorded, I would still like to detect the partial sequence. Since I don’t know which events might be missing I must make all of the events optional. While this worked, it was significantly slower, to the point that it is an unworkable solution.

       

      CEP is set to alert as fast as possible – so when everything is optional, once it passes a single event the path to final in ComputationState is immediately found and the potentialMatch is emitted. If a perfect match is coming, I will not find the pattern unless the skip strategy is set to noSkip(). Because all are optional, in the instance of a perfect match, CEP wants to emit all 2^10 possible matches when noSkip() is used. This is what causes the extreme performance drop off.

       

      I would like to add an option to Flink CEP that allows the “best match” when the optional() attribute is used in the Pattern API. The default can still be “fastest match” which would operate as it does today. In the “best match” scenario, the potentialMatches would be held back (not emitted) until the full pattern time has passed.

       

      I have made some modifications to the source code (primarily org.apache.flink.cep.nfa.NFA) that show this can work and be nearly as fast as the perfect match option. My basic strategy is the following:

      • Once a longer match is found, discard all of the shorter length sub-matches from partialMatches and potentialMatches.
      • Eg –  [E1, $endState$]
      • [E1, E2, $endState$] - discard the [E1, $endState$] and [E2, $endState$] and those partialMatch paths as well
      • [E1, E2, E3, $endState$] – discard the [E1, E2, $endState$], etc.
      • DeweyNumber allows a way to quickly assess match lengths, by comparing DeweyNumber.lengths.
      • DeweyNumber provides a way to assess matches, with its versioning – matches tend to use a version of zero (eg. 1.0.0.0.0.0 is a perfect match to the first 6 events of the pattern)
      • Delay emitting the “best match” until the options are exhausted, eliminating the shorter matches as quickly as possible.

       

      I’m at a place in my modifications where it would be very beneficial to work with an expert that would understand how best to accomplish this without impacting current functionality/performance. I’m happy to collaborate and share my work if that helps. My current modifications are a combination of mods and logs to allow me to see what is going on internally in the CEP processing.

      Attachments

        Activity

          People

            Unassigned Unassigned
            wblehma Brian Lehman
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: