Details

      Description

      One of our node generates a lot of cassandra logs (int the 10 MB/s) and CPU usage has increased (by a factor 2-3).
      This was most probably triggered by a "nodetool snapshot" while a cleanup was already running on this node.

      An example of those logs:
      2015-12-08 09:15:17,794 INFO [ValidationExecutor:689]ColumnFamilyStore.java:1923 Spinning trying to capture released readers [...]
      2015-12-08 09:15:17,794 INFO [ValidationExecutor:689]ColumnFamilyStore.java:1924 Spinning trying to capture all readers [...]
      2015-12-08 09:15:17,795 INFO [ValidationExecutor:689]ColumnFamilyStore.java:1923 Spinning trying to capture released readers [...]
      2015-12-08 09:15:17,795 INFO [ValidationExecutor:689]ColumnFamilyStore.java:1924 Spinning trying to capture all readers [...]
      (I removed SSTableReader information because it's rather long... I can share it privately if needed)
      Note that the date has not been changed (only 1ms between logs)

      It should not generate that gigantic amount of logs

      This is probably linked to: https://issues.apache.org/jira/browse/CASSANDRA-9637

        Activity

        Hide
        krummas Marcus Eriksson added a comment -

        I couldn't reproduce this with just a snapshot, triggering a repair while cleanup was running did reproduce, working on that

        Show
        krummas Marcus Eriksson added a comment - I couldn't reproduce this with just a snapshot, triggering a repair while cleanup was running did reproduce, working on that
        Hide
        krummas Marcus Eriksson added a comment - - edited

        Problem is that we use the sstables marked as 'compacting' to create the canonical view (to get the original sstable instances) and when running cleanup, scrub, upgradesstables and anticompaction we mark each sstable as compacted as we finish it, but we keep it as compacting until the entire operation is done. This means that we have sstables which are marked as compacted in the CANONICAL_SSTABLES and we cannot reference them and it makes us log the 'spinning ... ' message.

        There is also an issue with the log rate limiting, it was meant to only log every 100ms, but it logs every iteration.

        Patch that unmarks compacting once the sstable finished and fixes the rate limiting:

        https://github.com/krummas/cassandra/commits/marcuse/10829
        and tests:
        https://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-10829-dtest/
        https://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-10829-testall/

        seems only the rate limiting fix applies to 2.2+ (edit: the other issues are already fixed with LifecycleTransaction)

        Show
        krummas Marcus Eriksson added a comment - - edited Problem is that we use the sstables marked as 'compacting' to create the canonical view (to get the original sstable instances) and when running cleanup, scrub, upgradesstables and anticompaction we mark each sstable as compacted as we finish it, but we keep it as compacting until the entire operation is done. This means that we have sstables which are marked as compacted in the CANONICAL_SSTABLES and we cannot reference them and it makes us log the 'spinning ... ' message. There is also an issue with the log rate limiting, it was meant to only log every 100ms, but it logs every iteration. Patch that unmarks compacting once the sstable finished and fixes the rate limiting: https://github.com/krummas/cassandra/commits/marcuse/10829 and tests: https://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-10829-dtest/ https://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-10829-testall/ seems only the rate limiting fix applies to 2.2+ (edit: the other issues are already fixed with LifecycleTransaction)
        Hide
        frousseau Fabien Rousseau added a comment -

        If I undertstand well:

        • log rate limit is now once every 100ms
        • rate limit is already on 2.2+ (not related to this ticket)
        • compacting sstables are now released sooner (as soon as it has been compacted, and no more at the end of the whole operation)

        So, thanks a lot for the fix.

        Show
        frousseau Fabien Rousseau added a comment - If I undertstand well: log rate limit is now once every 100ms rate limit is already on 2.2+ (not related to this ticket) compacting sstables are now released sooner (as soon as it has been compacted, and no more at the end of the whole operation) So, thanks a lot for the fix.
        Hide
        krummas Marcus Eriksson added a comment -

        correct except:

        rate limit is already on 2.2+ (not related to this ticket)

        no, that is the only thing that is needed on 2.2+

        Show
        krummas Marcus Eriksson added a comment - correct except: rate limit is already on 2.2+ (not related to this ticket) no, that is the only thing that is needed on 2.2+
        Hide
        carlyeks Carl Yeksigian added a comment -

        We need to ensure we only unmarkCompacting the sstables which we haven't already unmarked, otherwise we could be running a different compaction operation on the previously unmarked sstables and unmark it again after this operation is finished.

        Show
        carlyeks Carl Yeksigian added a comment - We need to ensure we only unmarkCompacting the sstables which we haven't already unmarked, otherwise we could be running a different compaction operation on the previously unmarked sstables and unmark it again after this operation is finished.
        Hide
        krummas Marcus Eriksson added a comment -

        good point, pushed new commit on top of the old one and triggered new builds in cassci.

        Show
        krummas Marcus Eriksson added a comment - good point, pushed new commit on top of the old one and triggered new builds in cassci.
        Hide
        carlyeks Carl Yeksigian added a comment -

        Looks good, but the test failures look like we need to use a ConcurrentHashSet for finished since we're getting {{IllegalStateException}}s.

        Show
        carlyeks Carl Yeksigian added a comment - Looks good, but the test failures look like we need to use a ConcurrentHashSet for finished since we're getting {{IllegalStateException}}s.
        Hide
        krummas Marcus Eriksson added a comment -

        right, forgot to trigger a new build for testall, did that now (no code changes)

        Show
        krummas Marcus Eriksson added a comment - right, forgot to trigger a new build for testall, did that now (no code changes)
        Hide
        carlyeks Carl Yeksigian added a comment -

        I think we need to make finished thread-safe, since we're submitting those to the compaction executor, which may be multi-threaded.

        Show
        carlyeks Carl Yeksigian added a comment - I think we need to make finished thread-safe, since we're submitting those to the compaction executor, which may be multi-threaded.
        Hide
        krummas Marcus Eriksson added a comment -

        pushed a fix and reran tests

        Show
        krummas Marcus Eriksson added a comment - pushed a fix and reran tests
        Hide
        carlyeks Carl Yeksigian added a comment -

        +1

        Show
        carlyeks Carl Yeksigian added a comment - +1
        Hide
        krummas Marcus Eriksson added a comment -

        Committed, with only the rate limiting fix merged up to 2.2/3.0/3.3/trunk

        Show
        krummas Marcus Eriksson added a comment - Committed, with only the rate limiting fix merged up to 2.2/3.0/3.3/trunk

          People

          • Assignee:
            krummas Marcus Eriksson
            Reporter:
            frousseau Fabien Rousseau
            Reviewer:
            Carl Yeksigian
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development