Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5.0
    • Fix Version/s: 0.5.0
    • Labels:
      None

      Description

      After running into issues related to the comm layer, I took a deeper look and realized parts of the code are really brittle or simply wrong, in particular in TCPEmitter (e.g. sync on a Boolean).

      We need a more thorough review of that code so we can improve its robustness.

        Activity

        Hide
        Matthieu Morel added a comment -

        Since all participants agreed, I have just added some logging and cleaned the javadoc, and merged this in the piper branch.

        https://git-wip-us.apache.org/repos/asf?p=incubator-s4.git;a=commit;h=19b689ed53273ac825adb70f4874b84553efa4d3

        Show
        Matthieu Morel added a comment - Since all participants agreed, I have just added some logging and cleaned the javadoc, and merged this in the piper branch. https://git-wip-us.apache.org/repos/asf?p=incubator-s4.git;a=commit;h=19b689ed53273ac825adb70f4874b84553efa4d3
        Hide
        Karthik Kambatla added a comment -

        I completely agree with your suggestion. Let us push the release with the patch here, but not forget to fix S4-41 and S4-45 for the next release.

        Show
        Karthik Kambatla added a comment - I completely agree with your suggestion. Let us push the release with the patch here, but not forget to fix S4-41 and S4-45 for the next release.
        Hide
        Matthieu Morel added a comment -

        Thanks Karthik for the feedback!

        It would be nice indeed to have more guarantees, but as Daniel suggested, the first requirement for the coming release is that things work as expected under normal conditions.

        Performance and optimizations are really tasks that we should leave for a subsequent release. I will suggest to retarget most related tickets for 0.6, in order to avoid further delay.

        So my proposal is to incorporate Daniel's patch, keep working on the release (fixes, docs, testing), and leave S4-41 and S4-45 for later.

        Of course, you may still work on the issues you mentioned, but in that case I think you should try a different approach, and they probably won't be in the coming release (not a problem).

        For info, I just uploaded in branch S4-75 a modification of the producer/consumer test that usually fails because messages get lost, even though there is no disconnection. When Daniel's patch is applied, all messages are correctly processed.

        Show
        Matthieu Morel added a comment - Thanks Karthik for the feedback! It would be nice indeed to have more guarantees, but as Daniel suggested, the first requirement for the coming release is that things work as expected under normal conditions. Performance and optimizations are really tasks that we should leave for a subsequent release. I will suggest to retarget most related tickets for 0.6, in order to avoid further delay. So my proposal is to incorporate Daniel's patch, keep working on the release (fixes, docs, testing), and leave S4-41 and S4-45 for later. Of course, you may still work on the issues you mentioned, but in that case I think you should try a different approach, and they probably won't be in the coming release (not a problem). For info, I just uploaded in branch S4-75 a modification of the producer/consumer test that usually fails because messages get lost, even though there is no disconnection. When Daniel's patch is applied, all messages are correctly processed.
        Hide
        Karthik Kambatla added a comment -

        Hi Daniel/Matthieu,

        Firstly, thanks for finding the issues with the TCPEmitter. However, I believe we need to support guaranteed event delivery with TCP. In the absence of guaranteed delivery, one might as well use UDP as it is faster. Also, it was this requirement that led to the complication of TCP code.

        That said, I feel we can take one of the two approaches –
        1. Find the issues with S4-7 and fix them; I wonder if we can take any other approach but maintain message queues for failed deliveries in the presence of network partitions.
        2. Push in Daniel's change, and work on S4-41 for guaranteed event delivery. Use Netty with UDP for performance - S4-45.

        I am sorry for not being able to work on the JIRAs assigned to me. Please re-assign if needed. Meanwhile, I ll try to work on S4-45 over the weekend.

        Show
        Karthik Kambatla added a comment - Hi Daniel/Matthieu, Firstly, thanks for finding the issues with the TCPEmitter. However, I believe we need to support guaranteed event delivery with TCP. In the absence of guaranteed delivery, one might as well use UDP as it is faster. Also, it was this requirement that led to the complication of TCP code. That said, I feel we can take one of the two approaches – 1. Find the issues with S4-7 and fix them; I wonder if we can take any other approach but maintain message queues for failed deliveries in the presence of network partitions. 2. Push in Daniel's change, and work on S4-41 for guaranteed event delivery. Use Netty with UDP for performance - S4-45 . I am sorry for not being able to work on the JIRAs assigned to me. Please re-assign if needed. Meanwhile, I ll try to work on S4-45 over the weekend.
        Hide
        Daniel Gómez Ferro added a comment -

        This patch simplifies TCPEmitter removing queues, retrying of failed events, etc. and fixing the loss of events under a normal run.

        It still handles the 2 first failure scenarios of S4-7 but not the third.

        Show
        Daniel Gómez Ferro added a comment - This patch simplifies TCPEmitter removing queues, retrying of failed events, etc. and fixing the loss of events under a normal run. It still handles the 2 first failure scenarios of S4-7 but not the third.
        Hide
        Daniel Gómez Ferro added a comment -

        I've been having issues with the TCPEmitter. Lots of events were lost under high loads, here are some example runs:

        Successful: 47806  Lost: 241024
        Successful: 220771 Lost: 68127
        Successful: 53524  Lost: 235283
        

        I tried a simpler implementation of the TCPEmitter (no queues, doesn't resend if an event is lost...) and I didn't lose any event.

        I propose to substitute the current TCPEmitter, introduced in S4-7, with this simpler implementation. It fixes these problems from the original JIRA:

        1. If the underlying topology changes, the channels and the associated connections are not updated.
        2. If a connection gets disconnected, it stays disconnected.

        But not this one:

        3. If for any reason, a connection can't be made, send() drops the message to be sent.

        I think it is worth it to lose some events on a connection failure instead of losing them when all connections are working properly.

        We can open a JIRA to reincorporate a queuing mechanism that fixes problem 3. in the future.

        Show
        Daniel Gómez Ferro added a comment - I've been having issues with the TCPEmitter. Lots of events were lost under high loads, here are some example runs: Successful: 47806 Lost: 241024 Successful: 220771 Lost: 68127 Successful: 53524 Lost: 235283 I tried a simpler implementation of the TCPEmitter (no queues, doesn't resend if an event is lost...) and I didn't lose any event. I propose to substitute the current TCPEmitter, introduced in S4-7 , with this simpler implementation. It fixes these problems from the original JIRA: 1. If the underlying topology changes, the channels and the associated connections are not updated. 2. If a connection gets disconnected, it stays disconnected. But not this one: 3. If for any reason, a connection can't be made, send() drops the message to be sent. I think it is worth it to lose some events on a connection failure instead of losing them when all connections are working properly. We can open a JIRA to reincorporate a queuing mechanism that fixes problem 3. in the future.

          People

          • Assignee:
            Unassigned
            Reporter:
            Matthieu Morel
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development