CouchDB
  1. CouchDB
  2. COUCHDB-761

Timeouts in couch_log are masked, crashes callers

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.10.1, 0.10.2, 0.11
    • Fix Version/s: 0.11.1, 1.0, 1.0.3
    • Component/s: Database Core
    • Labels:
      None

      Description

      Several users have reported seeing crash reports stemming from a function_clause match on handle_info in various gen_servers. The offending message looks like

      {#Ref<>, <integer>}

      .

      After months of banter and sleuthing, I determined that the likely cause was a late reply to a gen_server:call that timed out, with the #Ref being the tag on the response. After it came up again today in IRC, kocolosk quickly discovered that the problem appears to be in couch_log.erl.

      The logging macros (?LOG_) call couch_log/_on which calls get_level_integer/0. When this call times out the timeout is eaten and a late reply arrives to the calling process later, triggering the crash.

      Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

      1. improved-sync-logging.patch
        2 kB
        Randall Leeds
      2. improved-sync-logging-v2.patch
        5 kB
        Randall Leeds

        Activity

        Hide
        Adam Kocoloski added a comment -

        Thanks for filing this ticket, Randall. I'm bumping it to Blocker.

        I'd discourage fully async logging. I've tried it in the past; it's far too easy to overwhelm the error_logger process with debug messages. Eventually the error_logger mailbox exhausts the available memory and the VM dies a horrible death.

        Infinite timeouts are a viable option in my opinion. Another option is to spawn a function to log the message:

        Pros:

        • doesn't block the original process

        Cons:

        • spends extra CPU cycles copying data to new process heap
        • potential to exhaust process limit

        Personally, I don't think it's worth the risk. Here's what I'd propose:

        1) Reimplement debug_on(), info_on() to use ets table lookups. This is pretty easy because the log level is already stored in couch_config.

        2) If the log level is enabled, use an infinite timeout to log the message.

        This way we can suppress the LOG_DEBUG messages without slowing down request processing by more than a few µs, we fix the crashes implicated in this ticket, and we keep the error_logger mailbox small.

        Show
        Adam Kocoloski added a comment - Thanks for filing this ticket, Randall. I'm bumping it to Blocker. I'd discourage fully async logging. I've tried it in the past; it's far too easy to overwhelm the error_logger process with debug messages. Eventually the error_logger mailbox exhausts the available memory and the VM dies a horrible death. Infinite timeouts are a viable option in my opinion. Another option is to spawn a function to log the message: Pros: doesn't block the original process Cons: spends extra CPU cycles copying data to new process heap potential to exhaust process limit Personally, I don't think it's worth the risk. Here's what I'd propose: 1) Reimplement debug_on(), info_on() to use ets table lookups. This is pretty easy because the log level is already stored in couch_config. 2) If the log level is enabled, use an infinite timeout to log the message. This way we can suppress the LOG_DEBUG messages without slowing down request processing by more than a few µs, we fix the crashes implicated in this ticket, and we keep the error_logger mailbox small.
        Hide
        Adam Kocoloski added a comment -

        To clarify, the way to do an infinite timeout in step 2) is to use gen_event:sync_notify instead of error_logger:info_report.

        Show
        Adam Kocoloski added a comment - To clarify, the way to do an infinite timeout in step 2) is to use gen_event:sync_notify instead of error_logger:info_report.
        Hide
        Randall Leeds added a comment -

        Here's my first run at a patch.

        Show
        Randall Leeds added a comment - Here's my first run at a patch.
        Hide
        Adam Kocoloski added a comment -

        I haven't tested it, but the patch looks correct.

        Since we're moving away from the *_report functions we could also simplify the messages a bit. For example, instead of

        gen_event:sync_notify(error_logger, {info_report, group_leader(), {self(), couch_info,

        {Format, Args}

        }});

        we could do

        gen_event:sync_notify(error_logger,

        {couch, info, self(), Format, Args}

        );

        or, since we're doing sync logging, we might even partially format the message in the sending process and make it a little simpler to copy:

        gen_event:sync_notify(error_logger,

        {couch, info, self(), io_lib:format(Format, Args)}

        );

        Randall, what do you think?

        Show
        Adam Kocoloski added a comment - I haven't tested it, but the patch looks correct. Since we're moving away from the *_report functions we could also simplify the messages a bit. For example, instead of gen_event:sync_notify(error_logger, {info_report, group_leader(), {self(), couch_info, {Format, Args} }}); we could do gen_event:sync_notify(error_logger, {couch, info, self(), Format, Args} ); or, since we're doing sync logging, we might even partially format the message in the sending process and make it a little simpler to copy: gen_event:sync_notify(error_logger, {couch, info, self(), io_lib:format(Format, Args)} ); Randall, what do you think?
        Hide
        Randall Leeds added a comment -

        I'm not sure that we can change the format of those messages unless we go directly to couch_log and then we have to make sure couch_log understands two message formats or we're gonna miss the sasl log messages.

        error_logger reads those messages to figure out what kind of report to send out to the registered handlers. I'd have to double check the code by I feel like it might ignore us if we send it something else.

        Show
        Randall Leeds added a comment - I'm not sure that we can change the format of those messages unless we go directly to couch_log and then we have to make sure couch_log understands two message formats or we're gonna miss the sasl log messages. error_logger reads those messages to figure out what kind of report to send out to the registered handlers. I'd have to double check the code by I feel like it might ignore us if we send it something else.
        Hide
        Adam Kocoloski added a comment -

        Oh, we definitely have to update the handle_event function in couch_log. For example, instead of

        handle_event({error_report, _, {Pid, couch_error,

        {Format, Args}

        }},

        {Fd, _LogLevel}=State) -> ...

        it would be

        handle_event({couch, error, Msg, {Fd, _LogLevel}

        =State) -> ...

        couch_log already has a separate clause for SASL error messages, so no worries there.

        error_logger won't ignore messages in this format; it'll forward them to the various handlers (including couch_log) just fine.

        Show
        Adam Kocoloski added a comment - Oh, we definitely have to update the handle_event function in couch_log. For example, instead of handle_event({error_report, _, {Pid, couch_error, {Format, Args} }}, {Fd, _LogLevel}=State) -> ... it would be handle_event({couch, error, Msg, {Fd, _LogLevel} =State) -> ... couch_log already has a separate clause for SASL error messages, so no worries there. error_logger won't ignore messages in this format; it'll forward them to the various handlers (including couch_log) just fine.
        Hide
        Randall Leeds added a comment -

        Here's the second version of the patch.

        Changes:
        1) I took Adam's advice and simplified the message format for messages sent by the ?LOG_* macros.
        2) I re-ordered the handle_event clauses. The first few handle the new log macro message format, the next few handle messages coming from sasl
        3) I took this opportunity to add a config direction log:include_sasl which defaults to true. When include_sasl is false the couch_log process will ignore sasl messages. I thought about making this dependent on log level somehow but decided I prefer to give people options and have sensible defaults.

        What this accomplishes:
        1) synchronous logging - log messages are never missed since we use sync_notify now instead of the notify call error_logger uses
        2) log macros for log levels that are disabled should have less performance impact. I don't know and haven't tested how an ets lookup compares to a gen_server call but at least responses can be fast even when the log server is busy formatting a long message
        3) eliminating the gen:call for couch_log:log_level_integer/0 means no more unexpected

        {#Ref<...>, LogLevel}

        messages arriving after a timeout due to a slow or busy log server process
        4) sasl logging can be disabled in production environments where the entire state of crashing gen_servers is probably waaay TMI when we probably ?LOG_ERROR a more useful message anyway

        Open for re-review, comments, questions.

        Show
        Randall Leeds added a comment - Here's the second version of the patch. Changes: 1) I took Adam's advice and simplified the message format for messages sent by the ?LOG_* macros. 2) I re-ordered the handle_event clauses. The first few handle the new log macro message format, the next few handle messages coming from sasl 3) I took this opportunity to add a config direction log:include_sasl which defaults to true. When include_sasl is false the couch_log process will ignore sasl messages. I thought about making this dependent on log level somehow but decided I prefer to give people options and have sensible defaults. What this accomplishes: 1) synchronous logging - log messages are never missed since we use sync_notify now instead of the notify call error_logger uses 2) log macros for log levels that are disabled should have less performance impact. I don't know and haven't tested how an ets lookup compares to a gen_server call but at least responses can be fast even when the log server is busy formatting a long message 3) eliminating the gen:call for couch_log:log_level_integer/0 means no more unexpected {#Ref<...>, LogLevel} messages arriving after a timeout due to a slow or busy log server process 4) sasl logging can be disabled in production environments where the entire state of crashing gen_servers is probably waaay TMI when we probably ?LOG_ERROR a more useful message anyway Open for re-review, comments, questions.
        Hide
        Randall Leeds added a comment -

        Had this in production for about a week now on a bunch of servers. Seems to fix the timeout problem mentioned above. I'd appreciate if some brave soul would apply this patch and give their server a beating to be sure everything looks okay. Some of the servers where this is running here are under crazy load and nothing seems broken, but I like second opinions, especially on things that could have broad performance implications under stress.

        Show
        Randall Leeds added a comment - Had this in production for about a week now on a bunch of servers. Seems to fix the timeout problem mentioned above. I'd appreciate if some brave soul would apply this patch and give their server a beating to be sure everything looks okay. Some of the servers where this is running here are under crazy load and nothing seems broken, but I like second opinions, especially on things that could have broad performance implications under stress.
        Hide
        Adam Kocoloski added a comment -

        I say we commit. We've been running a similar sync_notify logger at Cloudant for a long time now without issues.

        Show
        Adam Kocoloski added a comment - I say we commit. We've been running a similar sync_notify logger at Cloudant for a long time now without issues.
        Hide
        Adam Kocoloski added a comment -

        I needed one more patch for get_level_integer() to get make check running, since some of the tests call couch code that tries to log when couch_log is not running. I've inlined it below. I've committed on trunk and backported to 0.10.x. Waiting on 0.11.x because Jan has a monster fix for that branch in the works.

        diff --git a/src/couchdb/couch_log.erl b/src/couchdb/couch_log.erl
        index 5c8a5e5..2d62cbb 100644
        — a/src/couchdb/couch_log.erl
        +++ b/src/couchdb/couch_log.erl
        @@ -81,7 +81,11 @@ get_level() ->
        level_atom(get_level_integer()).

        get_level_integer() ->

        • ets:lookup_element(?MODULE, level, 2).
          + try
          + ets:lookup_element(?MODULE, level, 2)
          + catch error:badarg ->
          + ?LEVEL_ERROR
          + end.

        set_level_integer(Int) ->
        gen_event:call(error_logger, couch_log,

        {set_level_integer, Int}

        ).

        Show
        Adam Kocoloski added a comment - I needed one more patch for get_level_integer() to get make check running, since some of the tests call couch code that tries to log when couch_log is not running. I've inlined it below. I've committed on trunk and backported to 0.10.x. Waiting on 0.11.x because Jan has a monster fix for that branch in the works. diff --git a/src/couchdb/couch_log.erl b/src/couchdb/couch_log.erl index 5c8a5e5..2d62cbb 100644 — a/src/couchdb/couch_log.erl +++ b/src/couchdb/couch_log.erl @@ -81,7 +81,11 @@ get_level() -> level_atom(get_level_integer()). get_level_integer() -> ets:lookup_element(?MODULE, level, 2). + try + ets:lookup_element(?MODULE, level, 2) + catch error:badarg -> + ?LEVEL_ERROR + end. set_level_integer(Int) -> gen_event:call(error_logger, couch_log, {set_level_integer, Int} ).
        Hide
        Randall Leeds added a comment - - edited

        Since there's a new config option added by this patch, should it be added to the default.ini? Probably my bad for not including it originally.

        Show
        Randall Leeds added a comment - - edited Since there's a new config option added by this patch, should it be added to the default.ini? Probably my bad for not including it originally.
        Hide
        Adam Kocoloski added a comment -

        Good idea, Randall. I added the include_sasl to default.ini in trunk, but I don't think we necessarily need to backport it.

        Show
        Adam Kocoloski added a comment - Good idea, Randall. I added the include_sasl to default.ini in trunk, but I don't think we necessarily need to backport it.

          People

          • Assignee:
            Unassigned
            Reporter:
            Randall Leeds
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development