Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-6589

ListSerializer should deserialize as ArrayList with size + 1

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Implemented
    • Affects Version/s: 1.3.0, 1.4.0
    • Fix Version/s: 1.3.0, 1.4.0
    • Component/s: Core
    • Labels:
      None

      Description

      The ListSerializer deserializes a list as ArrayList with exactly the required capacity, i.e., number of serialized objects.

      Several operators in the Table API have a MapState<Long, List<X>> to store received elements in a list per timestamp. Hence, retrieving the list and adding one element to the list is a very common operation.

      Since the list which is deserialized has no room left for adding elements, the first insertion into the list will result in growing the ArrayList which is expensive.

      I propose to initialize the ArrayList returned by the ListSerializer with numberOfSerializedElements + 1. This will only marginally increase the size of the list and allow for one insertion without growing the list.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user fhueske opened a pull request:

          https://github.com/apache/flink/pull/3912

          FLINK-6589 [core] Deserialize ArrayList with capacity of size+1 to prevent growth.

          Several Table API / SQL operators hold records in a `MapState[Long, List[X]]` keyed on a timestamp.
          When a new record arrives, the corresponding list is fetched and the record is added to the list.

          Currently, the `ListSerializer` deserializes lists as `ArrayList` with capacity exactly equal to the number of serialized elements. Hence, the `ArrayList` will grow when a new element is added which is an expensive operation.

          This PR changes the capacity of the deserialized `ArrayList` to #elements + 1 to avoid the growing the list when a single element is added.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/fhueske/flink listSer

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3912.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3912


          commit 1fa86b4e282ab0594d8dc768840de4c798e29591
          Author: Fabian Hueske <fhueske@apache.org>
          Date: 2017-05-15T19:41:51Z

          FLINK-6589 [core] Deserialize ArrayList with capacity of size+1 to prevent growth.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user fhueske opened a pull request: https://github.com/apache/flink/pull/3912 FLINK-6589 [core] Deserialize ArrayList with capacity of size+1 to prevent growth. Several Table API / SQL operators hold records in a `MapState[Long, List [X] ]` keyed on a timestamp. When a new record arrives, the corresponding list is fetched and the record is added to the list. Currently, the `ListSerializer` deserializes lists as `ArrayList` with capacity exactly equal to the number of serialized elements. Hence, the `ArrayList` will grow when a new element is added which is an expensive operation. This PR changes the capacity of the deserialized `ArrayList` to #elements + 1 to avoid the growing the list when a single element is added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/fhueske/flink listSer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3912.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3912 commit 1fa86b4e282ab0594d8dc768840de4c798e29591 Author: Fabian Hueske <fhueske@apache.org> Date: 2017-05-15T19:41:51Z FLINK-6589 [core] Deserialize ArrayList with capacity of size+1 to prevent growth.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user sunjincheng121 commented on the issue:

          https://github.com/apache/flink/pull/3912

          +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user sunjincheng121 commented on the issue: https://github.com/apache/flink/pull/3912 +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user fhueske commented on the issue:

          https://github.com/apache/flink/pull/3912

          Any concerns regarding this change @tzulitai @StephanEwen @StefanRRichter ?

          Show
          githubbot ASF GitHub Bot added a comment - Github user fhueske commented on the issue: https://github.com/apache/flink/pull/3912 Any concerns regarding this change @tzulitai @StephanEwen @StefanRRichter ?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StefanRRichter commented on the issue:

          https://github.com/apache/flink/pull/3912

          No concern, just curious if there is any reason for the 1 additional. The problem also still exists now for a second element. So is it, in general, unlikely that new elements are added? Or should we add some more additional space, like 10%?

          Show
          githubbot ASF GitHub Bot added a comment - Github user StefanRRichter commented on the issue: https://github.com/apache/flink/pull/3912 No concern, just curious if there is any reason for the 1 additional. The problem also still exists now for a second element. So is it, in general, unlikely that new elements are added? Or should we add some more additional space, like 10%?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tzulitai commented on the issue:

          https://github.com/apache/flink/pull/3912

          Share the same question as Stefan. Otherwise LGTM.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tzulitai commented on the issue: https://github.com/apache/flink/pull/3912 Share the same question as Stefan. Otherwise LGTM.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user fhueske commented on the issue:

          https://github.com/apache/flink/pull/3912

          At least in the Table API / SQL operators, we use a ProcessFunction add elements in the `processElement()` method. So, in the case that I want to improve just adds a single element.

          My rational was that this is the most common use case (I might be wrong here). I would be fine with adding more, but the question is how often is that actually used.

          Show
          githubbot ASF GitHub Bot added a comment - Github user fhueske commented on the issue: https://github.com/apache/flink/pull/3912 At least in the Table API / SQL operators, we use a ProcessFunction add elements in the `processElement()` method. So, in the case that I want to improve just adds a single element. My rational was that this is the most common use case (I might be wrong here). I would be fine with adding more, but the question is how often is that actually used.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tzulitai commented on the issue:

          https://github.com/apache/flink/pull/3912

          If the only case we currently know of right now is in the Table API with 1 additional element to be added, then I think this is good for now.

          I think the improvement should be bounded to specific cases anyway, because the problem will always exist no matter how much extra space we add. For example the `ArrayListSerializer` used for list state could have any arbitrary amounts of elements added.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tzulitai commented on the issue: https://github.com/apache/flink/pull/3912 If the only case we currently know of right now is in the Table API with 1 additional element to be added, then I think this is good for now. I think the improvement should be bounded to specific cases anyway, because the problem will always exist no matter how much extra space we add. For example the `ArrayListSerializer` used for list state could have any arbitrary amounts of elements added.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StefanRRichter commented on the issue:

          https://github.com/apache/flink/pull/3912

          Yes, there is a tradeoff, and probably the nicest way would be to restore the internal array size as it was when we serialized the original. However, I guess that is too much effort for the effect. Since the sweetspot also depends on the access pattern, I think everything from +1 to some percentage could be justified and as there is no clear "best strategy" I would leave this up to your judgment call.

          +1 from me then.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StefanRRichter commented on the issue: https://github.com/apache/flink/pull/3912 Yes, there is a tradeoff, and probably the nicest way would be to restore the internal array size as it was when we serialized the original. However, I guess that is too much effort for the effect. Since the sweetspot also depends on the access pattern, I think everything from +1 to some percentage could be justified and as there is no clear "best strategy" I would leave this up to your judgment call. +1 from me then.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user fhueske commented on the issue:

          https://github.com/apache/flink/pull/3912

          Thank you, I'll merge this then.

          Show
          githubbot ASF GitHub Bot added a comment - Github user fhueske commented on the issue: https://github.com/apache/flink/pull/3912 Thank you, I'll merge this then.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3912

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3912
          Hide
          fhueske Fabian Hueske added a comment -

          Implemented for 1.3.0 with 98f4fad93263b05f0e55562ddd81385430546225
          Implemented for 1.4.0 with 2bfead7d9bef51713ed203fa7979f71f23525733

          Show
          fhueske Fabian Hueske added a comment - Implemented for 1.3.0 with 98f4fad93263b05f0e55562ddd81385430546225 Implemented for 1.4.0 with 2bfead7d9bef51713ed203fa7979f71f23525733

            People

            • Assignee:
              fhueske Fabian Hueske
              Reporter:
              fhueske Fabian Hueske
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development