Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Flink ML benchmark framework estimates the throughput by having a source operator generate a given number (e.g. 10^7) of input records with random values, let the given AlgoOperator process these input records, and divide the number of records by the total execution time.
The overhead of generating random values for all input records has observable impact on the estimated throughput. We would like to minimize the overhead of the source operator so that the benchmark result can focus on the throughput of the AlgoOperator as much as possible.
Note that spark-sql-perf generates all input records in advance into memory before running the benchmark. This allows Spark ML benchmark to read records from memory instead of generating values for those records during the benchmark.
We can generate value once and re-use it for all input records. This approach minimizes the source operator head and allows us to compare Flink ML benchmark result with Spark ML benchmark result (from spark-sql-perf) fairly.
Attachments
Issue Links
- links to