[FLINK-31125] Flink ML benchmark framework should minimize the source operator overhead - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: ml-2.2.0
Component/s: Library / Machine Learning
Labels:
- pull-request-available

Description

Flink ML benchmark framework estimates the throughput by having a source operator generate a given number (e.g. 10^7) of input records with random values, let the given AlgoOperator process these input records, and divide the number of records by the total execution time.

The overhead of generating random values for all input records has observable impact on the estimated throughput. We would like to minimize the overhead of the source operator so that the benchmark result can focus on the throughput of the AlgoOperator as much as possible.

Note that spark-sql-perf generates all input records in advance into memory before running the benchmark. This allows Spark ML benchmark to read records from memory instead of generating values for those records during the benchmark.

We can generate value once and re-use it for all input records. This approach minimizes the source operator head and allows us to compare Flink ML benchmark result with Spark ML benchmark result (from spark-sql-perf) fairly.

Attachments

Issue Links

links to

GitHub Pull Request #212

Activity

People

Assignee:: Dong Lin

Reporter:: Dong Lin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Feb/23 14:16

Updated:: 01/Mar/23 09:58

Resolved:: 01/Mar/23 09:58