Flume
  1. Flume
  2. FLUME-1124

Lifecycle supervisor can cause thread contention, sometimes causing components to not startup.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: v1.1.0
    • Fix Version/s: v1.2.0
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      Based on Brock's review of my patch, the following issues have been added to this jira:

      This isn't your code but it looks like there are a few problems here.

      1) monitorService on MonitorRunnable is not set so an NPE (which is not being logged is thrown:

      java.lang.NullPointerException
      at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:255)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
      at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      at java.lang.Thread.run(Thread.java:662)

      2) The NPE is not being logged.
      3) If the NPE did not occur we'd have a bigger issue as scheduleAtFixedRate is called here but the runnable reschedules itself.
      4) Calling first supervise() and then start() creates two threads monitoring this service.

      1. FLUME-1124-1.patch
        2 kB
        Hari Shreedharan
      2. FLUME-1124-3.patch
        4 kB
        Hari Shreedharan

        Activity

        Hari Shreedharan created issue -
        Hide
        Hari Shreedharan added a comment -

        If there are several sinks and sources, the lifecycle supervisor for the node can cause thread contention. There is one single thread pool with 5 threads in the pool. Each component's monitorRunnable is scheduled to run immediately, each of which schedules itself to run after 3 seconds. This can cause issues if after 3 seconds some components remain to be started. They may not get started for a long time because the executor service's threads may be busy running the monitorRunnables which are scheduled by the initial component start runs.

        I am separating the starts to a different service - which is a SingleThreadExecutor.

        Show
        Hari Shreedharan added a comment - If there are several sinks and sources, the lifecycle supervisor for the node can cause thread contention. There is one single thread pool with 5 threads in the pool. Each component's monitorRunnable is scheduled to run immediately, each of which schedules itself to run after 3 seconds. This can cause issues if after 3 seconds some components remain to be started. They may not get started for a long time because the executor service's threads may be busy running the monitorRunnables which are scheduled by the initial component start runs. I am separating the starts to a different service - which is a SingleThreadExecutor.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4753/
        -----------------------------------------------------------

        Review request for Flume.

        Summary
        -------

        All components are started by the same executor, which is different from the executor service that checks the status of the components.

        This addresses bug FLUME-1124.
        https://issues.apache.org/jira/browse/FLUME-1124

        Diffs


        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6

        Diff: https://reviews.apache.org/r/4753/diff

        Testing
        -------

        Thanks,

        Hari

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/ ----------------------------------------------------------- Review request for Flume. Summary ------- All components are started by the same executor, which is different from the executor service that checks the status of the components. This addresses bug FLUME-1124 . https://issues.apache.org/jira/browse/FLUME-1124 Diffs flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6 Diff: https://reviews.apache.org/r/4753/diff Testing ------- Thanks, Hari
        Hari Shreedharan made changes -
        Field Original Value New Value
        Attachment FLUME-1124-1.patch [ 12522919 ]
        Hari Shreedharan made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4753/#review6979
        -----------------------------------------------------------

        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java
        <https://reviews.apache.org/r/4753/#comment15531>

        Should componentStartService be shutdown?

        • Brock

        On 2012-04-17 07:06:54, Hari Shreedharan wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4753/

        -----------------------------------------------------------

        (Updated 2012-04-17 07:06:54)

        Review request for Flume.

        Summary

        -------

        All components are started by the same executor, which is different from the executor service that checks the status of the components.

        This addresses bug FLUME-1124.

        https://issues.apache.org/jira/browse/FLUME-1124

        Diffs

        -----

        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6

        Diff: https://reviews.apache.org/r/4753/diff

        Testing

        -------

        Thanks,

        Hari

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/#review6979 ----------------------------------------------------------- flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java < https://reviews.apache.org/r/4753/#comment15531 > Should componentStartService be shutdown? Brock On 2012-04-17 07:06:54, Hari Shreedharan wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/ ----------------------------------------------------------- (Updated 2012-04-17 07:06:54) Review request for Flume. Summary ------- All components are started by the same executor, which is different from the executor service that checks the status of the components. This addresses bug FLUME-1124 . https://issues.apache.org/jira/browse/FLUME-1124 Diffs ----- flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6 Diff: https://reviews.apache.org/r/4753/diff Testing ------- Thanks, Hari
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4753/
        -----------------------------------------------------------

        (Updated 2012-04-17 15:06:58.103483)

        Review request for Flume.

        Changes
        -------

        Shutting down the componentStartService on stop call.

        Summary
        -------

        All components are started by the same executor, which is different from the executor service that checks the status of the components.

        This addresses bug FLUME-1124.
        https://issues.apache.org/jira/browse/FLUME-1124

        Diffs (updated)


        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6

        Diff: https://reviews.apache.org/r/4753/diff

        Testing
        -------

        Thanks,

        Hari

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/ ----------------------------------------------------------- (Updated 2012-04-17 15:06:58.103483) Review request for Flume. Changes ------- Shutting down the componentStartService on stop call. Summary ------- All components are started by the same executor, which is different from the executor service that checks the status of the components. This addresses bug FLUME-1124 . https://issues.apache.org/jira/browse/FLUME-1124 Diffs (updated) flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6 Diff: https://reviews.apache.org/r/4753/diff Testing ------- Thanks, Hari
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-04-17 14:34:44, Brock Noland wrote:

        > flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java, line 83

        > <https://reviews.apache.org/r/4753/diff/2/?file=102361#file102361line83>

        >

        > Should componentStartService be shutdown?

        It isn't really required, because each of the components are shutdown in the lines that immediately follow. But I have submitted an update, with a shutdown call.

        • Hari

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4753/#review6979
        -----------------------------------------------------------

        On 2012-04-17 15:06:58, Hari Shreedharan wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4753/

        -----------------------------------------------------------

        (Updated 2012-04-17 15:06:58)

        Review request for Flume.

        Summary

        -------

        All components are started by the same executor, which is different from the executor service that checks the status of the components.

        This addresses bug FLUME-1124.

        https://issues.apache.org/jira/browse/FLUME-1124

        Diffs

        -----

        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6

        Diff: https://reviews.apache.org/r/4753/diff

        Testing

        -------

        Thanks,

        Hari

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-04-17 14:34:44, Brock Noland wrote: > flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java, line 83 > < https://reviews.apache.org/r/4753/diff/2/?file=102361#file102361line83 > > > Should componentStartService be shutdown? It isn't really required, because each of the components are shutdown in the lines that immediately follow. But I have submitted an update, with a shutdown call. Hari ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/#review6979 ----------------------------------------------------------- On 2012-04-17 15:06:58, Hari Shreedharan wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/ ----------------------------------------------------------- (Updated 2012-04-17 15:06:58) Review request for Flume. Summary ------- All components are started by the same executor, which is different from the executor service that checks the status of the components. This addresses bug FLUME-1124 . https://issues.apache.org/jira/browse/FLUME-1124 Diffs ----- flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6 Diff: https://reviews.apache.org/r/4753/diff Testing ------- Thanks, Hari
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4753/#review6983
        -----------------------------------------------------------

        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java
        <https://reviews.apache.org/r/4753/#comment15535>

        This isn't your code but it looks like there are a few problems here.

        1) monitorService on MonitorRunnable is not set so an NPE (which is not being logged is thrown:

        java.lang.NullPointerException
        at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:255)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

        2) The NPE is not being logged.
        3) If the NPE did not occur we'd have a bigger issue as scheduleAtFixedRate is called here but the runnable reschedules itself.
        4) Calling first supervise() and then start() creates two threads monitoring this service.

        • Brock

        On 2012-04-17 15:06:58, Hari Shreedharan wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4753/

        -----------------------------------------------------------

        (Updated 2012-04-17 15:06:58)

        Review request for Flume.

        Summary

        -------

        All components are started by the same executor, which is different from the executor service that checks the status of the components.

        This addresses bug FLUME-1124.

        https://issues.apache.org/jira/browse/FLUME-1124

        Diffs

        -----

        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6

        Diff: https://reviews.apache.org/r/4753/diff

        Testing

        -------

        Thanks,

        Hari

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/#review6983 ----------------------------------------------------------- flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java < https://reviews.apache.org/r/4753/#comment15535 > This isn't your code but it looks like there are a few problems here. 1) monitorService on MonitorRunnable is not set so an NPE (which is not being logged is thrown: java.lang.NullPointerException at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:255) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2) The NPE is not being logged. 3) If the NPE did not occur we'd have a bigger issue as scheduleAtFixedRate is called here but the runnable reschedules itself. 4) Calling first supervise() and then start() creates two threads monitoring this service. Brock On 2012-04-17 15:06:58, Hari Shreedharan wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/ ----------------------------------------------------------- (Updated 2012-04-17 15:06:58) Review request for Flume. Summary ------- All components are started by the same executor, which is different from the executor service that checks the status of the components. This addresses bug FLUME-1124 . https://issues.apache.org/jira/browse/FLUME-1124 Diffs ----- flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6 Diff: https://reviews.apache.org/r/4753/diff Testing ------- Thanks, Hari
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-04-17 15:44:12, Brock Noland wrote:

        > flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java, line 68

        > <https://reviews.apache.org/r/4753/diff/3/?file=102382#file102382line68>

        >

        > This isn't your code but it looks like there are a few problems here.

        >

        > 1) monitorService on MonitorRunnable is not set so an NPE (which is not being logged is thrown:

        >

        > java.lang.NullPointerException

        > at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:255)

        > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

        > at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)

        > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)

        > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)

        > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)

        > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)

        > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        > at java.lang.Thread.run(Thread.java:662)

        >

        > 2) The NPE is not being logged.

        > 3) If the NPE did not occur we'd have a bigger issue as scheduleAtFixedRate is called here but the runnable reschedules itself.

        > 4) Calling first supervise() and then start() creates two threads monitoring this service.

        Hi Brock,

        None of the issues are caused by this change. I can look into these and fix them, it seems like it is because of the way the start method is written. Here is a brief explanation:
        1 and 2) This is caused by the start() function being called without supervise being called. The reason we have not been seeing this seems to be that when start is called, the different components have not been added. So it never goes into the loop.
        3) Agree this is an issue, but because of the explanation to (1 and 2) above.
        4) Agreed, we should not be doing this. I will update the relevant code.

        • Hari

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4753/#review6983
        -----------------------------------------------------------

        On 2012-04-17 15:06:58, Hari Shreedharan wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4753/

        -----------------------------------------------------------

        (Updated 2012-04-17 15:06:58)

        Review request for Flume.

        Summary

        -------

        All components are started by the same executor, which is different from the executor service that checks the status of the components.

        This addresses bug FLUME-1124.

        https://issues.apache.org/jira/browse/FLUME-1124

        Diffs

        -----

        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6

        Diff: https://reviews.apache.org/r/4753/diff

        Testing

        -------

        Thanks,

        Hari

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-04-17 15:44:12, Brock Noland wrote: > flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java, line 68 > < https://reviews.apache.org/r/4753/diff/3/?file=102382#file102382line68 > > > This isn't your code but it looks like there are a few problems here. > > 1) monitorService on MonitorRunnable is not set so an NPE (which is not being logged is thrown: > > java.lang.NullPointerException > at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:255) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > > 2) The NPE is not being logged. > 3) If the NPE did not occur we'd have a bigger issue as scheduleAtFixedRate is called here but the runnable reschedules itself. > 4) Calling first supervise() and then start() creates two threads monitoring this service. Hi Brock, None of the issues are caused by this change. I can look into these and fix them, it seems like it is because of the way the start method is written. Here is a brief explanation: 1 and 2) This is caused by the start() function being called without supervise being called. The reason we have not been seeing this seems to be that when start is called, the different components have not been added. So it never goes into the loop. 3) Agree this is an issue, but because of the explanation to (1 and 2) above. 4) Agreed, we should not be doing this. I will update the relevant code. Hari ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/#review6983 ----------------------------------------------------------- On 2012-04-17 15:06:58, Hari Shreedharan wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/ ----------------------------------------------------------- (Updated 2012-04-17 15:06:58) Review request for Flume. Summary ------- All components are started by the same executor, which is different from the executor service that checks the status of the components. This addresses bug FLUME-1124 . https://issues.apache.org/jira/browse/FLUME-1124 Diffs ----- flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6 Diff: https://reviews.apache.org/r/4753/diff Testing ------- Thanks, Hari
        Hari Shreedharan made changes -
        Description Based on Brock's review of my patch, the following issues have been added to this jira:

        This isn't your code but it looks like there are a few problems here.

        1) monitorService on MonitorRunnable is not set so an NPE (which is not being logged is thrown:

        java.lang.NullPointerException
        at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:255)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

        2) The NPE is not being logged.
        3) If the NPE did not occur we'd have a bigger issue as scheduleAtFixedRate is called here but the runnable reschedules itself.
        4) Calling first supervise() and then start() creates two threads monitoring this service.

        Hari Shreedharan made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-04-17 15:44:12, Brock Noland wrote:

        > flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java, line 68

        > <https://reviews.apache.org/r/4753/diff/3/?file=102382#file102382line68>

        >

        > This isn't your code but it looks like there are a few problems here.

        >

        > 1) monitorService on MonitorRunnable is not set so an NPE (which is not being logged is thrown:

        >

        > java.lang.NullPointerException

        > at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:255)

        > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

        > at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)

        > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)

        > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)

        > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)

        > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)

        > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        > at java.lang.Thread.run(Thread.java:662)

        >

        > 2) The NPE is not being logged.

        > 3) If the NPE did not occur we'd have a bigger issue as scheduleAtFixedRate is called here but the runnable reschedules itself.

        > 4) Calling first supervise() and then start() creates two threads monitoring this service.

        Hari Shreedharan wrote:

        Hi Brock,

        None of the issues are caused by this change. I can look into these and fix them, it seems like it is because of the way the start method is written. Here is a brief explanation:

        1 and 2) This is caused by the start() function being called without supervise being called. The reason we have not been seeing this seems to be that when start is called, the different components have not been added. So it never goes into the loop.

        3) Agree this is an issue, but because of the explanation to (1 and 2) above.

        4) Agreed, we should not be doing this. I will update the relevant code.

        I did not mean to imply this was caused by your change, but if we are "fixing" the class we might as well fix these issues as well.

        • Brock

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4753/#review6983
        -----------------------------------------------------------

        On 2012-04-17 15:06:58, Hari Shreedharan wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4753/

        -----------------------------------------------------------

        (Updated 2012-04-17 15:06:58)

        Review request for Flume.

        Summary

        -------

        All components are started by the same executor, which is different from the executor service that checks the status of the components.

        This addresses bug FLUME-1124.

        https://issues.apache.org/jira/browse/FLUME-1124

        Diffs

        -----

        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6

        Diff: https://reviews.apache.org/r/4753/diff

        Testing

        -------

        Thanks,

        Hari

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-04-17 15:44:12, Brock Noland wrote: > flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java, line 68 > < https://reviews.apache.org/r/4753/diff/3/?file=102382#file102382line68 > > > This isn't your code but it looks like there are a few problems here. > > 1) monitorService on MonitorRunnable is not set so an NPE (which is not being logged is thrown: > > java.lang.NullPointerException > at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:255) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > > 2) The NPE is not being logged. > 3) If the NPE did not occur we'd have a bigger issue as scheduleAtFixedRate is called here but the runnable reschedules itself. > 4) Calling first supervise() and then start() creates two threads monitoring this service. Hari Shreedharan wrote: Hi Brock, None of the issues are caused by this change. I can look into these and fix them, it seems like it is because of the way the start method is written. Here is a brief explanation: 1 and 2) This is caused by the start() function being called without supervise being called. The reason we have not been seeing this seems to be that when start is called, the different components have not been added. So it never goes into the loop. 3) Agree this is an issue, but because of the explanation to (1 and 2) above. 4) Agreed, we should not be doing this. I will update the relevant code. I did not mean to imply this was caused by your change, but if we are "fixing" the class we might as well fix these issues as well. Brock ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/#review6983 ----------------------------------------------------------- On 2012-04-17 15:06:58, Hari Shreedharan wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/ ----------------------------------------------------------- (Updated 2012-04-17 15:06:58) Review request for Flume. Summary ------- All components are started by the same executor, which is different from the executor service that checks the status of the components. This addresses bug FLUME-1124 . https://issues.apache.org/jira/browse/FLUME-1124 Diffs ----- flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6 Diff: https://reviews.apache.org/r/4753/diff Testing ------- Thanks, Hari
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2012-04-17 15:44:12, Brock Noland wrote:

        > flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java, line 68

        > <https://reviews.apache.org/r/4753/diff/3/?file=102382#file102382line68>

        >

        > This isn't your code but it looks like there are a few problems here.

        >

        > 1) monitorService on MonitorRunnable is not set so an NPE (which is not being logged is thrown:

        >

        > java.lang.NullPointerException

        > at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:255)

        > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

        > at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)

        > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)

        > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)

        > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)

        > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)

        > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        > at java.lang.Thread.run(Thread.java:662)

        >

        > 2) The NPE is not being logged.

        > 3) If the NPE did not occur we'd have a bigger issue as scheduleAtFixedRate is called here but the runnable reschedules itself.

        > 4) Calling first supervise() and then start() creates two threads monitoring this service.

        Hari Shreedharan wrote:

        Hi Brock,

        None of the issues are caused by this change. I can look into these and fix them, it seems like it is because of the way the start method is written. Here is a brief explanation:

        1 and 2) This is caused by the start() function being called without supervise being called. The reason we have not been seeing this seems to be that when start is called, the different components have not been added. So it never goes into the loop.

        3) Agree this is an issue, but because of the explanation to (1 and 2) above.

        4) Agreed, we should not be doing this. I will update the relevant code.

        Brock Noland wrote:

        I did not mean to imply this was caused by your change, but if we are "fixing" the class we might as well fix these issues as well.

        Yep, updated the jira.

        • Hari

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4753/#review6983
        -----------------------------------------------------------

        On 2012-04-17 15:06:58, Hari Shreedharan wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4753/

        -----------------------------------------------------------

        (Updated 2012-04-17 15:06:58)

        Review request for Flume.

        Summary

        -------

        All components are started by the same executor, which is different from the executor service that checks the status of the components.

        This addresses bug FLUME-1124.

        https://issues.apache.org/jira/browse/FLUME-1124

        Diffs

        -----

        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6

        Diff: https://reviews.apache.org/r/4753/diff

        Testing

        -------

        Thanks,

        Hari

        Show
        jiraposter@reviews.apache.org added a comment - On 2012-04-17 15:44:12, Brock Noland wrote: > flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java, line 68 > < https://reviews.apache.org/r/4753/diff/3/?file=102382#file102382line68 > > > This isn't your code but it looks like there are a few problems here. > > 1) monitorService on MonitorRunnable is not set so an NPE (which is not being logged is thrown: > > java.lang.NullPointerException > at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:255) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > > 2) The NPE is not being logged. > 3) If the NPE did not occur we'd have a bigger issue as scheduleAtFixedRate is called here but the runnable reschedules itself. > 4) Calling first supervise() and then start() creates two threads monitoring this service. Hari Shreedharan wrote: Hi Brock, None of the issues are caused by this change. I can look into these and fix them, it seems like it is because of the way the start method is written. Here is a brief explanation: 1 and 2) This is caused by the start() function being called without supervise being called. The reason we have not been seeing this seems to be that when start is called, the different components have not been added. So it never goes into the loop. 3) Agree this is an issue, but because of the explanation to (1 and 2) above. 4) Agreed, we should not be doing this. I will update the relevant code. Brock Noland wrote: I did not mean to imply this was caused by your change, but if we are "fixing" the class we might as well fix these issues as well. Yep, updated the jira. Hari ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/#review6983 ----------------------------------------------------------- On 2012-04-17 15:06:58, Hari Shreedharan wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/ ----------------------------------------------------------- (Updated 2012-04-17 15:06:58) Review request for Flume. Summary ------- All components are started by the same executor, which is different from the executor service that checks the status of the components. This addresses bug FLUME-1124 . https://issues.apache.org/jira/browse/FLUME-1124 Diffs ----- flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6 Diff: https://reviews.apache.org/r/4753/diff Testing ------- Thanks, Hari
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4753/
        -----------------------------------------------------------

        (Updated 2012-04-18 08:14:42.844120)

        Review request for Flume.

        Changes
        -------

        Using ScheduledThreadPoolExecutor now. No longer rescheduling from the monitorRunnable. When an unsupervise happens, the monitor is removed from the thread pool queue.

        Summary
        -------

        All components are started by the same executor, which is different from the executor service that checks the status of the components.

        This addresses bug FLUME-1124.
        https://issues.apache.org/jira/browse/FLUME-1124

        Diffs (updated)


        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6

        Diff: https://reviews.apache.org/r/4753/diff

        Testing
        -------

        Thanks,

        Hari

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/ ----------------------------------------------------------- (Updated 2012-04-18 08:14:42.844120) Review request for Flume. Changes ------- Using ScheduledThreadPoolExecutor now. No longer rescheduling from the monitorRunnable. When an unsupervise happens, the monitor is removed from the thread pool queue. Summary ------- All components are started by the same executor, which is different from the executor service that checks the status of the components. This addresses bug FLUME-1124 . https://issues.apache.org/jira/browse/FLUME-1124 Diffs (updated) flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6 Diff: https://reviews.apache.org/r/4753/diff Testing ------- Thanks, Hari
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4753/#review7002
        -----------------------------------------------------------

        Ship it!

        Hari, thank you very much! Please attach the patch to the JIRA.

        • Brock

        On 2012-04-18 08:14:42, Hari Shreedharan wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4753/

        -----------------------------------------------------------

        (Updated 2012-04-18 08:14:42)

        Review request for Flume.

        Summary

        -------

        All components are started by the same executor, which is different from the executor service that checks the status of the components.

        This addresses bug FLUME-1124.

        https://issues.apache.org/jira/browse/FLUME-1124

        Diffs

        -----

        flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6

        Diff: https://reviews.apache.org/r/4753/diff

        Testing

        -------

        Thanks,

        Hari

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/#review7002 ----------------------------------------------------------- Ship it! Hari, thank you very much! Please attach the patch to the JIRA. Brock On 2012-04-18 08:14:42, Hari Shreedharan wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4753/ ----------------------------------------------------------- (Updated 2012-04-18 08:14:42) Review request for Flume. Summary ------- All components are started by the same executor, which is different from the executor service that checks the status of the components. This addresses bug FLUME-1124 . https://issues.apache.org/jira/browse/FLUME-1124 Diffs ----- flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java a7407e6 Diff: https://reviews.apache.org/r/4753/diff Testing ------- Thanks, Hari
        Hari Shreedharan made changes -
        Attachment FLUME-1124-3.patch [ 12523210 ]
        Hari Shreedharan made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Brock Noland added a comment -

        Committed in 1327622. Thank you for your patch Hari!

        Show
        Brock Noland added a comment - Committed in 1327622. Thank you for your patch Hari!
        Brock Noland made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in flume-trunk #175 (See https://builds.apache.org/job/flume-trunk/175/)
        FLUME-1124: Lifecycle supervisor can cause thread contention, sometimes causing components to not startup.

        (Hari Shreedharan via Brock Noland) (Revision 1327622)

        Result = SUCCESS
        brock : http://svn.apache.org/viewvc/?view=rev&rev=1327622
        Files :

        • /incubator/flume/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java
        Show
        Hudson added a comment - Integrated in flume-trunk #175 (See https://builds.apache.org/job/flume-trunk/175/ ) FLUME-1124 : Lifecycle supervisor can cause thread contention, sometimes causing components to not startup. (Hari Shreedharan via Brock Noland) (Revision 1327622) Result = SUCCESS brock : http://svn.apache.org/viewvc/?view=rev&rev=1327622 Files : /incubator/flume/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Patch Available Patch Available Open Open
        11h 56m 1 Hari Shreedharan 17/Apr/12 20:43
        Open Open Patch Available Patch Available
        1d 4h 21m 2 Hari Shreedharan 18/Apr/12 17:43
        Patch Available Patch Available Resolved Resolved
        2h 3m 1 Brock Noland 18/Apr/12 19:47

          People

          • Assignee:
            Hari Shreedharan
            Reporter:
            Hari Shreedharan
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development