Guys, are you ok with answers provided above, except those still under discussion as described below?
Yep, I think so.
GroupIntoNSets is confusing/not very useful as defined. Chris had a good idea to set the N there to be the same as the number of containers. This is likely its most common use case. This will move the code into some package that's aware of YARN since it requires a YARN-specific config (or we need to change the yarn container count config to be more generic, but that's out of scope here).
I still feel like the strategy is just a bit weird. As I understand it, the reason that it exists is just to reduce the number of TIs (vs. just having one TI per-partition, and spreading the TIs evenly amongst containers). TIs are cheap, though. The only issue we've seen (that I can recall) with having a lot of TIs is that this can cause an explosion of the number of metrics that are reported. In the case of the JmxReporter and MetricsSnapshotReporter, there isn't any problem with this. Some other systems can have problems with the number of metrics being reported, but I think there are better solutions to that problem than introducing a grouping strategy to reduce the number of TIs. Better solutions to this problem would be to filter (either with a debug-level, or a black list) the number of metrics, or just use metrics infrastructure that can handle 10k-100k metrics/job. I don't think we need this strategy.
The bookkeeping necessary to support state and these new features should be kept away from the state log itself and moved into some central location with other per-job info (whatever form that takes)
I'm not wild about shard, as I think that's a pretty key term for databases and may lead people to think we this grouping functions the same way.
Task Name has a pretty big flaw it conflates these tasks with map-reduce tasks, when in fact our closest analog is the Samza container.
The issue I take with adding any new word is that we already have the word "task" deeply ingrained in Samza (StreamTask, TaskInstance, TaskLifecycleListener, TaskContext, TaskCoordinator, TaskStorageManager, etc). Some of these are public user-facing highly visible APIs. The fact is, we have a task, and it means a certain thing. It's documented. Talks have been given on it. Adding a new word to describe a task, while still keeping the word" task" everywhere else, just makes things worse, not better.
If we want to open up a separate discussion about refactoring everything to change the task stuff to something else, we can discuss that, but I see it as an orthogonal discussion to this issue. Simply adding "cohort" for one thing in this ticket, which diverges from the entire rest of the code-base, seems wrong to me. I remain pretty strongly convinced that taskName is in keeping with the rest of the code base as it exists today.
Essentially, it's a chicken-egg thing - people won't use this feature and find its limitations/build cooler things with Samza, if it's not a feature we publicize and make available.
I can live with this. (edit: by "I can live with this," I mean I can live with making the feature pluggable ).
This approach seems like the smallest necessary change to implement the current JIRA and doesn't expose anything that can't be changed in the future, absent more discussion.
Broadly, I agree with you. I think the ConfigLog is totally something we should explore in a separate ticket to figure out what stuff should be added to it, how it would work, etc. My main concern here is about the sequencing of things. The smallest necessary change to implement the current JIRA is actually just to add the information to the existing checkpoint.
Here's my concern with adding the ConfigLog as part of this JIRA: we don't think the implications of the ConfigLog all the way through because we want to limit the scope of this JIRA and finish it faster (with fewer code changes/less risk). As a result, we now have two topics: ConfigLog and the checkpoint topic. Then, we open up a second ConfigLog JIRA, and discover some flaw with it, or decide we want a different design. Now we have two topics to migrate (checkpoint log, and config log), as well as two different chunks of code to update/delete, and three different versions of a job (one with just checkpoints, one with checkpoint and config log from this jira, and one with checkpoint and config log from new jira). This is complicated.
This concern leads me to want to just put the cohort/SSP and state mapping stuff into the existing Checkpoint class and use our existing CheckpointManager. As a follow on JIRA, we can do a nice detailed design doc on the ConfigLog, and think through things in more detail. Since we'll have to migrate the checkpoint no matter what, sticking the cohort/SSP stuff into the checkpoint shouldn't introduce any additional work, and not committing to a ConfigLog now will mean we'll feel more comfortable thinking openly about how exactly it should be implemented (rather than being tied to the implementation that falls out of this JIRA, which might or might not be the best approach when we dig into it in more detail).