1. Agreed. Will remove (per-RB comments).
2. I have added this segment to the wiki:
Since the MessageChooser only receives one message at a time per SSP, it can be used to order messages between different SSPs, but it can't be used to re-order messages within a single SSP (a buffered sort). This must be done within a StreamTask.
I have also updated the Javadocs for MessageChooser to show this. (not yet pushed to a new patch, though)
3. We could do random. I think I'll probably do whatever is easiest to implement, whether that's random, round robin, or whatever. Just as long as it's not a strategy that can lead to starvation.
4.1 Agreed. I do, however, want to make the config as simple as possible. The default when no configs are set would be all that all streams are set to an equal priority of 0, and no stream is bootstrapped.
To just bootstrap a stream:
This would set currencyCodes' priority to Int.MAX and make the SystemConsumers class force messages for the currencyCodes stream.
To override a priority you could then do:
To bootstrap and prioritize simultaneously:
This would bootstrap currencyCodes, and then favor messages from liveStream over messages fro batchComputedStream.
Bootstrapping multiple streams in a specific order could then be done using:
This would bootstrap currencyCodes first, then currencyConversions, and then read messages from all streams as usual.
4.2 I think there are a lot of ways to implement the bootstrapping feature. Here are a few:
1. Give the StreamTask some way to tell the container which stream it wants next.
2. Make the container register only the bootstrap streams with the SystemContainers class first, then register the rest of the streams when the bootstrap is complete.
3. The wiki proposal.
4. StreamTask buffers all non-boostrap-stream messages that it can't process yet because the bootstrap streams haven't been fully bootstrapped.
5. Declare that all bootstrap streams must be fed directly into a StorageEngine.
All of these except (4) require code changes in the SamzaContainer/SystemConsumers code, and (4) is a non-workable because it could force a StreamTask to buffer an unbounded amount of data. Even (4) would require a slight code change so that the StreamTask can know when a stream has been fully bootstrapped (envelope.getOffset == lastOffsetInStreamPartition).
(5) is kind of interesting because the StorageEngine.restore phase is handled at a different point in the lifecycle, before the StreamTask handles any messages. One knock on this approach is that it forces a bootstrap stream to be materialized to a StorageEngine without any transformations. For example, if you had a profile stream, where messages were ~5kb each (due to plain text in profile description), but all you wanted was the user's name, the StorageEngine would still be restored with the full profile messages in the store, which is wasteful. It also wouldn't allow for any aggregation or filtering of a bootstrap stream's messages.
5. Totally agree. I think we'll just punt on the whole time-alignment thing now. Developers can either implement a MessageChooser that does approximate time alignment, or a StreamTask, which does a time-aligned buffered sort (or they can do both). If we decide to add stronger time support in the future, I can't see any backwards incompatible changes that would need to be made, so just making this a feature that developers have to implement seems safest and most general for now.