bc Wong, thanks for your interests in the timeline server and sharing your idea. Here’re some of my opinions and our previous rationales.
Let's have reliability before speed. I think one of the requirement of ATS is: The channel for writing events should be reliable.
I agree reliability is an important requirement of the timeline server, but the other requirements such as scalability and efficiency should be orthogonal to it, such that there’s no order of which should come first. We can pursue both enhancement, can’t we?
I'm using reliable here in a strong sense, not the TCP-best-effort style reliability. HDFS is reliable. Kafka is reliable. (They are also scalable and robust.)
IMHO, it may be unfair to compare the reliability between TCP and HDFS, Kafka, because they’re on the different layer of the communication stack. HDFS and Kafka are also built on top of TCP for communication, right? In my previous comments, I’ve mentioned that we need to clearly define *reliability, and I’d like to highlight it here again:
1. Server is reliable: when timeline entities is passed to the timeline server, it should prevent them from being lost. After YARN-2032, we’re going to have HBase timeline store to ensure it.
2. Client is reliable: once the timeline entities are hand over to the timeline client, before the timeline client successfully put in to the timeline sever, it should prevent them being lost at the client side. We may use some techniques to cache the entities locally. I opened YANR-2521 to track the dissuasion along this direction.
Between client and server, TCP is the trustworthy protocol. If client gets ACK from server, we should be confident that the server already gets the entities.
A normal RPC connection is not. I don't want the ATS to be able to slow down my writes, and therefore, my applications, at all.
I’m not sure there's the direct relationship between reliability and nonblocking writing. For example, submitting app via YarnClient to HA RM is reliable, but the user is still likely to blocked until the app submission is responded. Whether writing events is blocking or non-blocking depends on how the user uses the client. In
YARN-2033, I make RM put the entities on a separate thread to prevent blocking the dispatcher for managing YARN app lifecycle. And I can see that nonblocking writing is a useful optimization, such that I’ve opened YARN-2517 to implement TimelineClientAsync for general usage.
Yes, you could make a distributed reliable scalable "ATS service" to accept writing events. But that seems a lot of work, while we can leverage existing technologies.
AFAIK, the timeline server is a stateless machine, it should not be difficult to use Zookeeper to manage a number instances and writing to the same HBase cluster. We may need to pay attention to load balancing, and concurrent writing. I’m not sure it will really be a lot of work. Please let me know if I’ve neglected some important pieces. And in the scope of YARN, we already accumulated similar experience when making HA RM, and it turns out to be a practical solution. Again, this is about scalability, which is orthogonal to reliability. Even we pass the timeline entities via Kafka/HDFS to the timeline server, the single server is still going to be the bottleneck of processing a large volume of requests, no matter how big the Kafaka/HDFS cluster is.
If the channel itself is pluggable, then we have lots of options. Kafka is a very good choice, for sites that already deploy Kafka and know how to operate it. Using HDFS as a channel is also a good default implementation, for people already know how to scale and manage HDFS.
I’m not object to having different entity publishing channels, but my concern is the effort to maintain the timeline client is going to be folded per number of the channels. As the timeline server is going to to be long term project, we can not neglect the additional workload of evolving all channels. And this is the similar concern that we want to remove the FS-based history store (see YARN-2320). Maybe cooperatively improving the current channel is a more cost-efficient choice. It’s good to think more before opening a new channel.
In addition, the default solution is good to be simple and self-contained. A heavy solution with complex configuration and and large dependency is likely to prolong the learning curve to keep new adopters away, and complicate fast, small-scale deployment.