Navina Ramesh Thanks so much for your valuable feedback. Please take a look at the updated RB when you are free. In regards to your comments on the design doc, I have updated the design doc as well, here is my answers to your questions:
Q: Is the “End of Stream” feature a pre-requisite for HDFS consumer? If yes, link the corresponding JIRA and design document. Providing a high-level description of how that feature will be leveraged for solving this problem will layout more ground-work for readers who are not familiar about this
A: Yes. Updated the doc and the jira to reflect that Samza-974 is a pre-requisite
Q: One of the goals and non-goals are slightly overlapping. "(Goal) The system consumer should support a variety of folder structures and filename conventions" and "(Non-Goal) Support ALL kinds of HDFS folder structures and filename formats" . Can you specifically call out which structure and conventions you are supporting or call out which ones you are not supporting? Just to more clarity to the document.
A: Updated the doc to be more specific.
Q: Along with the 3rd point under Assumptions, you can call out "write-once, read-many" as the underlying usage pattern.
Q: What does the whitelist and blacklist here consists of ? Why do we need both ? Can you provide example of how this config will look like?
A: As pointed out in the design doc, this is to simplify the regex by having two instead of one regex. Many systems including kafka is doing this. You can always craft one regex to combine whitelist and blacklist, but that's gonna look complicated. Updated doc to give examples.
Q: In case of repartitioner, multiple samza tasks cannot write to the same file. Hence, each task can write in a separate file within the partition directory -> what defines the ordering among these files when the downstream job is consuming ? is it based on timestamp?
A: In this case there is no ordering among these files. Let's imaging, instead of writing to HDFS, we write to Kafka, then you also have no ordering within the samza topic partition when the events are coming from different upstream producers.
Q: when does the HDFSSystemAdmin write the PartitionDescriptor to HDFS?? Is it done by the job coordinator or by each container?
A: This is more of an implementation details so I didn't provide specifics on the doc. You are right, it's done by job coordinator. It happens when getSystemStreamMetadata is called given the current implementation.
Q: Is the PartitionDescriptor file expected to follow any convention? Or is it simply going to contain a map?
A: It's simply a map in the json format.