Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-7471

Improve bounded OVER support non-retract method AGG

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Table API & SQL
    • Labels:
      None

      Description

      Currently BOUNDED OVER WINDOW only support have retract method AGG. In this JIRA. will add non-retract method support.
      What do you think? Fabian Hueske

        Issue Links

          Activity

          Hide
          fhueske Fabian Hueske added a comment -

          Are you proposing to compute all aggregates for each record completely from scratch?
          How would we separate retracting and non-retracting aggregation functions from each other?
          Does this require a major refactoring of the code-generation and aggregation infrastructure that we implemented some months ago?

          Show
          fhueske Fabian Hueske added a comment - Are you proposing to compute all aggregates for each record completely from scratch? How would we separate retracting and non-retracting aggregation functions from each other? Does this require a major refactoring of the code-generation and aggregation infrastructure that we implemented some months ago?
          Hide
          sunjincheng121 sunjincheng added a comment -

          We know it's poor compute if all aggregates for each record completely from scratch performance. that's why we using retract method to implement BOUNDED OVER WINDOW. I think that's the best choice.
          We also have add the comment in AggregateFunction for retract method.

          ...
          This(retract) function must be implemented for
          datastream bounded over aggregate.
          

          So, I think we did the right thing when we do the aggregate design and implementation of the bounded over.

          But I think using the retract method is an optimized implementation of bounded over window, but we can not force the user to implement the retract method. especially the business scenario is an append only table that never generates retract information, so the user may define a aggregate function that does not have a retract method. we can not force the user must implement the retract method because of the way of realization for bounded over window. And I think we may not need a lot of refactoring the original code, but on the basis of the original code to increase the non-retract method support.

          Fabian Hueske I think your worry is necessary, but I do not really like the implementation of the OVER window will force the user to implement the AggregateFunction # retract method.

          Show
          sunjincheng121 sunjincheng added a comment - We know it's poor compute if all aggregates for each record completely from scratch performance. that's why we using retract method to implement BOUNDED OVER WINDOW. I think that's the best choice. We also have add the comment in AggregateFunction for retract method. ... This(retract) function must be implemented for datastream bounded over aggregate. So, I think we did the right thing when we do the aggregate design and implementation of the bounded over. But I think using the retract method is an optimized implementation of bounded over window, but we can not force the user to implement the retract method. especially the business scenario is an append only table that never generates retract information, so the user may define a aggregate function that does not have a retract method. we can not force the user must implement the retract method because of the way of realization for bounded over window. And I think we may not need a lot of refactoring the original code, but on the basis of the original code to increase the non-retract method support. Fabian Hueske I think your worry is necessary, but I do not really like the implementation of the OVER window will force the user to implement the AggregateFunction # retract method.
          Hide
          jark Jark Wu added a comment -

          Hi Fabian Hueske, I think sunjincheng mean that retraction for over aggregates is an optimization in most cases but not all cases. For example, min/max with retraction needs to store all records of the group in the min/max's accumulator. It's hard to say the overload of retraction mode (store all records) is certainly less than non-retraction mode (compute all records of a group). That's why the over window can support non-retract AGG. The over window mode (retract/non-retract) is depends on whether all the aggregates implement the retract methods. And the retract method is not mandatory anymore for over window.

          Show
          jark Jark Wu added a comment - Hi Fabian Hueske , I think sunjincheng mean that retraction for over aggregates is an optimization in most cases but not all cases. For example, min/max with retraction needs to store all records of the group in the min/max's accumulator. It's hard to say the overload of retraction mode (store all records) is certainly less than non-retraction mode (compute all records of a group). That's why the over window can support non-retract AGG. The over window mode (retract/non-retract) is depends on whether all the aggregates implement the retract methods. And the retract method is not mandatory anymore for over window.
          Hide
          fhueske Fabian Hueske added a comment -

          I understand the motivation for this issue and agree that it would be good to support non-retractable UDAGGs for bounded OVER aggregations.
          My question is rather what happens if we have a mix of retractable and non-retractable aggregation functions? Are the retractable functions efficiently computed or not?

          Treating all aggregation functions as non-retractable might introduce a significant and unnecessary performance penalty. Think a of a query with several retractable aggregations and then adding one non-retractable UDAGG.

          Handling retractable and non-retractable aggregation functions separately would mean to add quite a bit of additional logic. For example, we would need to extend the GeneratedAggregations interface to account for "empty" fields in the output record which would be filled by the other aggregation logic.

          Show
          fhueske Fabian Hueske added a comment - I understand the motivation for this issue and agree that it would be good to support non-retractable UDAGGs for bounded OVER aggregations. My question is rather what happens if we have a mix of retractable and non-retractable aggregation functions? Are the retractable functions efficiently computed or not? Treating all aggregation functions as non-retractable might introduce a significant and unnecessary performance penalty. Think a of a query with several retractable aggregations and then adding one non-retractable UDAGG. Handling retractable and non-retractable aggregation functions separately would mean to add quite a bit of additional logic. For example, we would need to extend the GeneratedAggregations interface to account for "empty" fields in the output record which would be filled by the other aggregation logic.

            People

            • Assignee:
              sunjincheng121 sunjincheng
              Reporter:
              sunjincheng121 sunjincheng
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development