Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37062

Introduce a new data source for providing consistent set of rows per microbatch




      The "rate" data source has been known to be used as a benchmark for streaming query.

      While this helps to put the query to the limit (how many rows the query could process per second), the rate data source doesn't provide consistent rows per batch into stream, which leads two environments be hard to compare with.

      For example, in many cases, you may want to compare the metrics in the batches between test environments (like running same streaming query with different options). These metrics are strongly affected if the distribution of input rows in batches are changing, especially a micro-batch has been lagged (in any reason) and rate data source produces more input rows to the next batch.

      Also, when you test against streaming aggregation, you may want the data source produces the same set of input rows per batch (deterministic), so that you can plan how these input rows will be aggregated and how state rows will be evicted, and craft the test query based on the plan.

      The requirements of new data source would follow:

      • it should produce a specific number of input rows as requested
      • it should also include a timestamp (event time) into each row
        • to make the input rows fully deterministic, timestamp should be configured as well (like start timestamp & amount of advance per batch)




            kabhwan Jungtaek Lim
            kabhwan Jungtaek Lim
            0 Vote for this issue
            2 Start watching this issue