Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None

      Issue Links

        Activity

        Hide
        lhaiesp Hai added a comment -
        Show
        lhaiesp Hai added a comment - https://reviews.apache.org/r/51142/
        Hide
        navina Navina Ramesh added a comment -

        Hai Sorry for the delay in my review. I strongly urge you to post a [DISCUSS] or [RFC] email in the dev mailing list to get more eyes on your work and potentially, more feedback.

        Overall, the design document looks good. I have couple of questions:

        • Is the “End of Stream” feature a pre-requisite for HDFS consumer? If yes, link the corresponding JIRA and design document. Providing a high-level description of how that feature will be leveraged for solving this problem will layout more ground-work for readers who are not familiar about this
        • One of the goals and non-goals are slightly overlapping. "(Goal) The system consumer should support a variety of folder structures and filename conventions" and "(Non-Goal) Support ALL kinds of HDFS folder structures and filename formats" . Can you specifically call out which structure and conventions you are supporting or call out which ones you are not supporting? Just to more clarity to the document.
        • Along with the 3rd point under Assumptions, you can call out "write-once, read-many" as the underlying usage pattern.
        • What does the whitelist and blacklist here consists of ? Why do we need both ? Can you provide example of how this config will look like?
        • In case of repartitioner, multiple samza tasks cannot write to the same file. Hence, each task can write in a separate file within the partition directory -> what defines the ordering among these files when the downstream job is consuming ? is it based on timestamp?
        • when does the HDFSSystemAdmin write the PartitionDescriptor to HDFS?? Is it done by the job coordinator or by each container?
        • Is the PartitionDescriptor file expected to follow any convention? Or is it simply going to contain a map?

        Cheers!

        PS: I am looking at your RB now

        Show
        navina Navina Ramesh added a comment - Hai Sorry for the delay in my review. I strongly urge you to post a [DISCUSS] or [RFC] email in the dev mailing list to get more eyes on your work and potentially, more feedback. Overall, the design document looks good. I have couple of questions: Is the “End of Stream” feature a pre-requisite for HDFS consumer? If yes, link the corresponding JIRA and design document. Providing a high-level description of how that feature will be leveraged for solving this problem will layout more ground-work for readers who are not familiar about this One of the goals and non-goals are slightly overlapping. "(Goal) The system consumer should support a variety of folder structures and filename conventions" and "(Non-Goal) Support ALL kinds of HDFS folder structures and filename formats" . Can you specifically call out which structure and conventions you are supporting or call out which ones you are not supporting? Just to more clarity to the document. Along with the 3rd point under Assumptions, you can call out "write-once, read-many" as the underlying usage pattern. What does the whitelist and blacklist here consists of ? Why do we need both ? Can you provide example of how this config will look like? In case of repartitioner, multiple samza tasks cannot write to the same file. Hence, each task can write in a separate file within the partition directory -> what defines the ordering among these files when the downstream job is consuming ? is it based on timestamp? when does the HDFSSystemAdmin write the PartitionDescriptor to HDFS?? Is it done by the job coordinator or by each container? Is the PartitionDescriptor file expected to follow any convention? Or is it simply going to contain a map? Cheers! PS: I am looking at your RB now
        Hide
        navina Navina Ramesh added a comment -

        Hai Left some comments in the RB. Thanks!

        Show
        navina Navina Ramesh added a comment - Hai Left some comments in the RB. Thanks!
        Hide
        lhaiesp Hai added a comment -

        Navina Ramesh Thanks so much for your valuable feedback. Please take a look at the updated RB when you are free. In regards to your comments on the design doc, I have updated the design doc as well, here is my answers to your questions:

        Q: Is the “End of Stream” feature a pre-requisite for HDFS consumer? If yes, link the corresponding JIRA and design document. Providing a high-level description of how that feature will be leveraged for solving this problem will layout more ground-work for readers who are not familiar about this
        A: Yes. Updated the doc and the jira to reflect that Samza-974 is a pre-requisite

        Q: One of the goals and non-goals are slightly overlapping. "(Goal) The system consumer should support a variety of folder structures and filename conventions" and "(Non-Goal) Support ALL kinds of HDFS folder structures and filename formats" . Can you specifically call out which structure and conventions you are supporting or call out which ones you are not supporting? Just to more clarity to the document.
        A: Updated the doc to be more specific.

        Q: Along with the 3rd point under Assumptions, you can call out "write-once, read-many" as the underlying usage pattern.
        A: Done

        Q: What does the whitelist and blacklist here consists of ? Why do we need both ? Can you provide example of how this config will look like?
        A: As pointed out in the design doc, this is to simplify the regex by having two instead of one regex. Many systems including kafka is doing this. You can always craft one regex to combine whitelist and blacklist, but that's gonna look complicated. Updated doc to give examples.

        Q: In case of repartitioner, multiple samza tasks cannot write to the same file. Hence, each task can write in a separate file within the partition directory -> what defines the ordering among these files when the downstream job is consuming ? is it based on timestamp?
        A: In this case there is no ordering among these files. Let's imaging, instead of writing to HDFS, we write to Kafka, then you also have no ordering within the samza topic partition when the events are coming from different upstream producers.

        Q: when does the HDFSSystemAdmin write the PartitionDescriptor to HDFS?? Is it done by the job coordinator or by each container?
        A: This is more of an implementation details so I didn't provide specifics on the doc. You are right, it's done by job coordinator. It happens when getSystemStreamMetadata is called given the current implementation.

        Q: Is the PartitionDescriptor file expected to follow any convention? Or is it simply going to contain a map?
        A: It's simply a map in the json format.

        Show
        lhaiesp Hai added a comment - Navina Ramesh Thanks so much for your valuable feedback. Please take a look at the updated RB when you are free. In regards to your comments on the design doc, I have updated the design doc as well, here is my answers to your questions: Q: Is the “End of Stream” feature a pre-requisite for HDFS consumer? If yes, link the corresponding JIRA and design document. Providing a high-level description of how that feature will be leveraged for solving this problem will layout more ground-work for readers who are not familiar about this A: Yes. Updated the doc and the jira to reflect that Samza-974 is a pre-requisite Q: One of the goals and non-goals are slightly overlapping. "(Goal) The system consumer should support a variety of folder structures and filename conventions" and "(Non-Goal) Support ALL kinds of HDFS folder structures and filename formats" . Can you specifically call out which structure and conventions you are supporting or call out which ones you are not supporting? Just to more clarity to the document. A: Updated the doc to be more specific. Q: Along with the 3rd point under Assumptions, you can call out "write-once, read-many" as the underlying usage pattern. A: Done Q: What does the whitelist and blacklist here consists of ? Why do we need both ? Can you provide example of how this config will look like? A: As pointed out in the design doc, this is to simplify the regex by having two instead of one regex. Many systems including kafka is doing this. You can always craft one regex to combine whitelist and blacklist, but that's gonna look complicated. Updated doc to give examples. Q: In case of repartitioner, multiple samza tasks cannot write to the same file. Hence, each task can write in a separate file within the partition directory -> what defines the ordering among these files when the downstream job is consuming ? is it based on timestamp? A: In this case there is no ordering among these files. Let's imaging, instead of writing to HDFS, we write to Kafka, then you also have no ordering within the samza topic partition when the events are coming from different upstream producers. Q: when does the HDFSSystemAdmin write the PartitionDescriptor to HDFS?? Is it done by the job coordinator or by each container? A: This is more of an implementation details so I didn't provide specifics on the doc. You are right, it's done by job coordinator. It happens when getSystemStreamMetadata is called given the current implementation. Q: Is the PartitionDescriptor file expected to follow any convention? Or is it simply going to contain a map? A: It's simply a map in the json format.
        Hide
        navina Navina Ramesh added a comment -

        Thanks for updating the design doc Hai and clarifying some of my questions.

        > In this case there is no ordering among these files. Let's imaging, instead of writing to HDFS, we write to Kafka, then you also have no ordering within the samza topic partition when the events are coming from different upstream producers.
        >> Ok. Let's say my repartitioner writes to a partition directory. If there is no implicit ordering defined in the output itself, how does a downstream HDFS consumer guarantee deterministic consumption? That is what I am not clear about.

        Show
        navina Navina Ramesh added a comment - Thanks for updating the design doc Hai and clarifying some of my questions. > In this case there is no ordering among these files. Let's imaging, instead of writing to HDFS, we write to Kafka, then you also have no ordering within the samza topic partition when the events are coming from different upstream producers. >> Ok. Let's say my repartitioner writes to a partition directory. If there is no implicit ordering defined in the output itself, how does a downstream HDFS consumer guarantee deterministic consumption? That is what I am not clear about.
        Hide
        lhaiesp Hai added a comment -

        You brought up a good point. There is no guarantee for deterministic consumption if repartitioning happens. But I think my point is that we are not able to solve this problem for Kafka either. Let's say we do repartitioning for a job that reads from Kafka and writes to Kafka, how do you guarantee consistent result, now? Well, you could argue that deterministic repartitioning result is not needed in the case of Kafka - a stream processing job, but is relevant in HDFS - essentially a batch processing job. I have to admit that I don't have a good solution to your question as of now

        Show
        lhaiesp Hai added a comment - You brought up a good point. There is no guarantee for deterministic consumption if repartitioning happens. But I think my point is that we are not able to solve this problem for Kafka either. Let's say we do repartitioning for a job that reads from Kafka and writes to Kafka, how do you guarantee consistent result, now? Well, you could argue that deterministic repartitioning result is not needed in the case of Kafka - a stream processing job, but is relevant in HDFS - essentially a batch processing job. I have to admit that I don't have a good solution to your question as of now
        Hide
        navina Navina Ramesh added a comment - - edited

        Let's say we do repartitioning for a job that reads from Kafka and writes to Kafka, how do you guarantee consistent result, now?
        >> You are right. The ordering is not guaranteed. But it is consistent every time I replay a partition. That's what is missing here.

        I have to admit that I don't have a good solution to your question as of now
        >> That's alright. I don't see a good solution here, without imposing some assumption on ordering of files. However, this will affect how downstream processors behave and is worth calling out in the code and documentations. Thanks!

        Show
        navina Navina Ramesh added a comment - - edited Let's say we do repartitioning for a job that reads from Kafka and writes to Kafka, how do you guarantee consistent result, now? >> You are right. The ordering is not guaranteed. But it is consistent every time I replay a partition. That's what is missing here. I have to admit that I don't have a good solution to your question as of now >> That's alright. I don't see a good solution here, without imposing some assumption on ordering of files. However, this will affect how downstream processors behave and is worth calling out in the code and documentations. Thanks!
        Hide
        xinyu Xinyu Liu added a comment -

        Merged and committed. Thanks!

        Show
        xinyu Xinyu Liu added a comment - Merged and committed. Thanks!

          People

          • Assignee:
            lhaiesp Hai
            Reporter:
            lhaiesp Hai
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development