Flow control (or rate control) for input data is very important in streaming system, especially for Spark Streaming to keep stable and up-to-date. The unexpected flood of incoming data or the high ingestion rate of input data which beyond the computation power of cluster will make the system unstable and increase the delay time. For Spark Streaming’s job generation and processing pattern, this delay will be accumulated and introduce unacceptable exceptions.
Currently in Spark Streaming’s receiver based input stream, there’s a RateLimiter in BlockGenerator which controls the ingestion rate of input data, but the current implementation has several limitations:
- The max ingestion rate is set by user through configuration beforehand, user may lack the experience of how to set an appropriate value before the application is running.
- This configuration is fixed through the life-time of application, which means you need to consider the worst scenario to set a reasonable configuration.
- Input stream like DirectKafkaInputStream need to maintain another solution to achieve the same functionality.
- Lack of slow start control makes the whole system easily trapped into large processing and scheduling delay at the very beginning.
So here we propose a new dynamic RateLimiter as well as the new interface for the RateLimiter to better improve the whole system's stability. The target is:
- Dynamically adjust the ingestion rate according to processing rate of previous finished jobs.
- Offer an uniform solution not only for receiver based input stream, but also for direct stream like DirectKafkaInputStream and new ones.
- Slow start rate to control the network congestion when job is started.
- Pluggable framework to make the maintenance of extension more easy.
Here is the design doc (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing) and working branch (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter).
Any comment would be greatly appreciated.