Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.2, 1.4.0, 1.3.2
    • Fix Version/s: 1.4.0, 1.3.2
    • Component/s: Documentation
    • Labels:
      None

      Description

      The CombineHint documentation applies to DataSet#reduce not DataSet#reduceGroup and should also be note for DataSet#distinct. It is also set with .setCombineHint(CombineHint) rather than alongside the UDF parameter.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user greghogan opened a pull request:

          https://github.com/apache/flink/pull/4372

          FLINK-7234 [docs] Fix CombineHint documentation

            1. What is the purpose of the change

          Update and correct the documentation for the use of CombineHint with `reduce` or `distinct`. The documentation for 1.2/1.3/1.4 has CombineHint described under `groupReduce`.

            1. Brief change log

          Documentation was moved and slightly reworded.

          This change is a trivial rework / code cleanup without any test coverage.

            1. Does this pull request potentially affect one of the following parts:

          Dependencies (does it add or upgrade a dependency): (yes / *no*)
          The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / *no*)
          The serializers: (yes / *no* / don't know)
          The runtime per-record code paths (performance sensitive): (yes / *no* / don't know)
          Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / *no* / don't know):

            1. Documentation

          Does this pull request introduce a new feature? (yes / *no*)
          If yes, how is the feature documented? (not applicable / *docs* / JavaDocs / not documented)

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/greghogan/flink 7234_fix_combinehint_documentation

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/4372.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #4372


          commit ec143c7ad08a47de96b418878efd540ec7328636
          Author: Greg Hogan <code@greghogan.com>
          Date: 2017-07-19T19:24:20Z

          FLINK-7234 [docs] Fix CombineHint documentation

          The CombineHint documentation applies to DataSet#reduce not
          DataSet#reduceGroup and should also be note for DataSet#distinct. It is
          also set with .setCombineHint(CombineHint) rather than alongside the UDF
          parameter.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user greghogan opened a pull request: https://github.com/apache/flink/pull/4372 FLINK-7234 [docs] Fix CombineHint documentation What is the purpose of the change Update and correct the documentation for the use of CombineHint with `reduce` or `distinct`. The documentation for 1.2/1.3/1.4 has CombineHint described under `groupReduce`. Brief change log Documentation was moved and slightly reworded. This change is a trivial rework / code cleanup without any test coverage. Does this pull request potentially affect one of the following parts: Dependencies (does it add or upgrade a dependency): (yes / * no *) The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / * no *) The serializers: (yes / * no * / don't know) The runtime per-record code paths (performance sensitive): (yes / * no * / don't know) Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / * no * / don't know): Documentation Does this pull request introduce a new feature? (yes / * no *) If yes, how is the feature documented? (not applicable / * docs * / JavaDocs / not documented) You can merge this pull request into a Git repository by running: $ git pull https://github.com/greghogan/flink 7234_fix_combinehint_documentation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4372.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4372 commit ec143c7ad08a47de96b418878efd540ec7328636 Author: Greg Hogan <code@greghogan.com> Date: 2017-07-19T19:24:20Z FLINK-7234 [docs] Fix CombineHint documentation The CombineHint documentation applies to DataSet#reduce not DataSet#reduceGroup and should also be note for DataSet#distinct. It is also set with .setCombineHint(CombineHint) rather than alongside the UDF parameter.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user fhueske commented on a diff in the pull request:

          https://github.com/apache/flink/pull/4372#discussion_r128351347

          — Diff: docs/dev/batch/index.md —
          @@ -205,20 +205,25 @@ data.filter(new FilterFunction<Integer>() {
          <td><strong>Reduce</strong></td>
          <td>
          <p>Combines a group of elements into a single element by repeatedly combining two elements

          • into one. Reduce may be applied on a full data set, or on a grouped data set.</p>
            + into one. Reduce may be applied on a full data set or on a grouped data set.</p> {% highlight java %}

            data.reduce(new ReduceFunction<Integer>

            Unknown macro: { public Integer reduce(Integer a, Integer b) { return a + b; } }

            );

            {% endhighlight %}

            + <p>If the reduce was applied to a grouped data set then you can specify the way that the

              • End diff –

          `If the reduce was ...` -> `If the reduce transformation was ...` or `If the reduce function was ...`?

          Show
          githubbot ASF GitHub Bot added a comment - Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/4372#discussion_r128351347 — Diff: docs/dev/batch/index.md — @@ -205,20 +205,25 @@ data.filter(new FilterFunction<Integer>() { <td><strong>Reduce</strong></td> <td> <p>Combines a group of elements into a single element by repeatedly combining two elements into one. Reduce may be applied on a full data set, or on a grouped data set.</p> + into one. Reduce may be applied on a full data set or on a grouped data set.</p> {% highlight java %} data.reduce(new ReduceFunction<Integer> Unknown macro: { public Integer reduce(Integer a, Integer b) { return a + b; } } ); {% endhighlight %} + <p>If the reduce was applied to a grouped data set then you can specify the way that the End diff – `If the reduce was ...` -> `If the reduce transformation was ...` or `If the reduce function was ...`?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user fhueske commented on the issue:

          https://github.com/apache/flink/pull/4372

          One minor comment. Otherwise +1 to merge.

          Show
          githubbot ASF GitHub Bot added a comment - Github user fhueske commented on the issue: https://github.com/apache/flink/pull/4372 One minor comment. Otherwise +1 to merge.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/4372

          Thanks for trying out the discussed template @greghogan

          I think for docs, you can invoke the (The sections below can be removed for hotfixes of typos) clause

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/4372 Thanks for trying out the discussed template @greghogan I think for docs, you can invoke the (The sections below can be removed for hotfixes of typos) clause
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user greghogan commented on the issue:

          https://github.com/apache/flink/pull/4372

          @StephanEwen I like the new template. I much prefer free form over checkboxes.

          @fhueske I'm questioning my understanding of the the heuristic for using a hash-combine. For a fixed number of keys the hash-combine can be beneficial independent of the size of the data set. Basing the decision on the ratio of keys to values, as the size of the data set increases the likelihood of matching keys and values occurring in the same combine operation (before filling and being flushed to the reducer) decreases.

          This is often the case for graphs. I'm thinking that the improvement for using hash-combine on larger data sets may have been due to hashing performing better than sort when we wanted to disable the combiner.

          Show
          githubbot ASF GitHub Bot added a comment - Github user greghogan commented on the issue: https://github.com/apache/flink/pull/4372 @StephanEwen I like the new template. I much prefer free form over checkboxes. @fhueske I'm questioning my understanding of the the heuristic for using a hash-combine. For a fixed number of keys the hash-combine can be beneficial independent of the size of the data set. Basing the decision on the ratio of keys to values, as the size of the data set increases the likelihood of matching keys and values occurring in the same combine operation (before filling and being flushed to the reducer) decreases. This is often the case for graphs. I'm thinking that the improvement for using hash-combine on larger data sets may have been due to hashing performing better than sort when we wanted to disable the combiner.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user fhueske commented on the issue:

          https://github.com/apache/flink/pull/4372

          I think you are right @greghogan.

          It's not about the ratio of #distinct keys to size of the dataset. But it's also not only the ratio of #distinct keys to size of the memory. The skew of the key distribution has an effect as well (hash-based combiners should better handle skew than sort-based combiners).

          Show
          githubbot ASF GitHub Bot added a comment - Github user fhueske commented on the issue: https://github.com/apache/flink/pull/4372 I think you are right @greghogan. It's not about the ratio of #distinct keys to size of the dataset. But it's also not only the ratio of #distinct keys to size of the memory. The skew of the key distribution has an effect as well (hash-based combiners should better handle skew than sort-based combiners).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user greghogan commented on the issue:

          https://github.com/apache/flink/pull/4372

          @fhueske I started running some benchmarks on HITS with each of the combiners (sort, hash, none) and at small scales hash is fasted followed by none with sort in last. None is somewhat faster than hash for very small graphs. I've gotten stuck on scale 23 where the job is deadlocking on the first iteration for all three types. Interestingly I also see the deadlock on my laptop running from IntelliJ. I need to look more because I'm not seeing what is releasing memory buffers when all operators are seemingly waiting on acquiring a buffer from `LocalBufferPool`.

          Show
          githubbot ASF GitHub Bot added a comment - Github user greghogan commented on the issue: https://github.com/apache/flink/pull/4372 @fhueske I started running some benchmarks on HITS with each of the combiners (sort, hash, none) and at small scales hash is fasted followed by none with sort in last. None is somewhat faster than hash for very small graphs. I've gotten stuck on scale 23 where the job is deadlocking on the first iteration for all three types. Interestingly I also see the deadlock on my laptop running from IntelliJ. I need to look more because I'm not seeing what is releasing memory buffers when all operators are seemingly waiting on acquiring a buffer from `LocalBufferPool`.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/4372

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/4372
          Hide
          greghogan Greg Hogan added a comment -

          master: 4a88f6587fdfadd5749188a76e6b38a3585cd31b
          release-1.3: d0a9fe013e0dccf7ff329aaedf085e8c1c133ae0

          Show
          greghogan Greg Hogan added a comment - master: 4a88f6587fdfadd5749188a76e6b38a3585cd31b release-1.3: d0a9fe013e0dccf7ff329aaedf085e8c1c133ae0

            People

            • Assignee:
              greghogan Greg Hogan
              Reporter:
              greghogan Greg Hogan
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development