Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-16406

Debug logging affects repair performance

    XMLWordPrintableJSON

Details

    Description

      While working on the repair quality testing in CASSANDRA-16245, it appeared that the node coordinating repairs on a 20GB per node dataset was generating more than 2G of log with a total duration for the incremental repair scenarios of ~2h40m:
      https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
      ]
       
      The logs showed a lot of messages from the MerkleTree class at high pace:

      DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) Fully inconsistent range [#<TreeRange (-6738651767434294419,-6738622715859497972] depth=11>, #<TreeRange (-6738622715859497972,-6738593664284701525] depth=11>]
      DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) Inconsistent digest on right sub-range #<TreeRange (-6738593664284701525,-6738535561135108630] depth=10>: [#<OffHeapInner -6738564612709905078 hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] children=[#<OffHeapLeaf [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> #<OffHeapLeaf [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, #<OffHeapInner -6738564612709905078 hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] children=[#<OffHeapLeaf [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> #<OffHeapLeaf [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>]
      DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) Hashing sub-ranges [#<TreeRange (-6738593664284701525,-6738564612709905078] depth=11>, #<TreeRange (-6738564612709905078,-6738535561135108630] depth=11>] for #<TreeRange (-6738593664284701525,-6738535561135108630] depth=10> divided by midpoint -6738564612709905078
      DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) Inconsistent digest on left sub-range #<TreeRange (-6738593664284701525,-6738564612709905078] depth=11>: [#<OffHeapLeaf [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, #<OffHeapLeaf [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>]
      DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) Inconsistent digest on right sub-range #<TreeRange (-6738564612709905078,-6738535561135108630] depth=11>: [#<OffHeapLeaf [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, #<OffHeapLeaf [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]
      DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) Fully inconsistent range [#<TreeRange (-6738593664284701525,-6738564612709905078] depth=11>, #<TreeRange (-6738564612709905078,-6738535561135108630] depth=11>]
      DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) Fully inconsistent range [#<TreeRange (-6738651767434294419,-6738593664284701525] depth=10>, #<TreeRange (-6738593664284701525,-6738535561135108630] depth=10>]
      DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) Inconsistent digest on right sub-range #<TreeRange (-6738535561135108630,-6738419354835922841] depth=9>: [#<OffHeapInner -6738477457985515736 hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] children=[#<OffHeapInner -6738506509560312183 hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] children=[#]> #<OffHeapInner -6738448406410719289 hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] children=[#]>]>, #<OffHeapInner -6738477457985515736 hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] children=[#<OffHeapInner -6738506509560312183 hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] children=[#]> #<OffHeapInner -6738448406410719289 hash=[adb59f5313473b44dd3b7fa697d72caaaa7b23b3c0610f23670942e2c137878a] children=[#]>]>]

      When disabling debug logging, the duration dropped to ~2h05m with decent log sizes:

      https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/38/workflows/c91e6ceb-438b-4f00-b38a-d670f9afb4c3/jobs/51

      There's apparently too much logging for each inconsistency found in the Merkle tree comparisons and we should move this to TRACE level if we still want to allow debug logging to be turned on by default.

      I'll prepare a patch for the MerkleTree class and run the repair testing scenarios again to verify their duration.

      Attachments

        1. without_debug_logging.png
          83 kB
          Alexander Dejanovski
        2. with_debug_logging.png
          89 kB
          Alexander Dejanovski
        3. CASSANDRA-16406.png
          146 kB
          Alexander Dejanovski
        4. 16406-2-trunk.txt
          8 kB
          Alexander Dejanovski

        Activity

          People

            adejanovski Alexander Dejanovski
            adejanovski Alexander Dejanovski
            Alexander Dejanovski
            Marcus Eriksson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 10m
                10m