Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Degradation - Resource Management
-
Normal
-
Normal
-
Adhoc Test
-
All
-
None
-
Description
While working on the repair quality testing in CASSANDRA-16245, it appeared that the node coordinating repairs on a 20GB per node dataset was generating more than 2G of log with a total duration for the incremental repair scenarios of ~2h40m:
https://app.circleci.com/pipelines/github/riptano/cassandra-rtest/37/workflows/6a7a41c8-0fca-4080-b37e-3b38998b3fab/jobs/49/steps
]
The logs showed a lot of messages from the MerkleTree class at high pace:
DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) Fully inconsistent range [#<TreeRange (-6738651767434294419,-6738622715859497972] depth=11>, #<TreeRange (-6738622715859497972,-6738593664284701525] depth=11>] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (9) Inconsistent digest on right sub-range #<TreeRange (-6738593664284701525,-6738535561135108630] depth=10>: [#<OffHeapInner -6738564612709905078 hash=[b8efd3d684474276f316b1bc9f23b463cda4f8d620a4b5cf4d2c7dbb101bbe85] children=[#<OffHeapLeaf [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]> #<OffHeapLeaf [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>]>, #<OffHeapInner -6738564612709905078 hash=[95334327a0b50b7d3c51d6588a5f3d57701e0b978f6ab28e85cda3cb5a094eb5] children=[#<OffHeapLeaf [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]> #<OffHeapLeaf [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>]>] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:262 - (10) Hashing sub-ranges [#<TreeRange (-6738593664284701525,-6738564612709905078] depth=11>, #<TreeRange (-6738564612709905078,-6738535561135108630] depth=11>] for #<TreeRange (-6738593664284701525,-6738535561135108630] depth=10> divided by midpoint -6738564612709905078 DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:272 - (10) Inconsistent digest on left sub-range #<TreeRange (-6738593664284701525,-6738564612709905078] depth=11>: [#<OffHeapLeaf [fff431a30da07558c5106897da9f3fc3bcc598fb59e67c6959f4e70b6d3a7722]>, #<OffHeapLeaf [0d2c5c21f6b04b098f71f43d0f3f3a5a823b372eb850261774847c456629bedd]>] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (10) Inconsistent digest on right sub-range #<TreeRange (-6738564612709905078,-6738535561135108630] depth=11>: [#<OffHeapLeaf [471be27589e7372e3606d92b45bc8ba07161602d7942c9a614d89ab07d21c9a7]>, #<OffHeapLeaf [981f1f0656054074b32022658560070df2253cb9373a9499f149df8e3c20f068]>] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (10) Fully inconsistent range [#<TreeRange (-6738593664284701525,-6738564612709905078] depth=11>, #<TreeRange (-6738564612709905078,-6738535561135108630] depth=11>] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:308 - (9) Fully inconsistent range [#<TreeRange (-6738651767434294419,-6738593664284701525] depth=10>, #<TreeRange (-6738593664284701525,-6738535561135108630] depth=10>] DEBUG [RepairJobTask:4] 2021-01-21 16:15:29,631 MerkleTree.java:292 - (8) Inconsistent digest on right sub-range #<TreeRange (-6738535561135108630,-6738419354835922841] depth=9>: [#<OffHeapInner -6738477457985515736 hash=[806ede1a35783bf5fafe8b8ccefe4d3ff48e8f0e1314f8a9ce4b23f13fed4bf9] children=[#<OffHeapInner -6738506509560312183 hash=[e6d133afbd8041266f8a1cfe456ff07c9e7debe8ff54279b579b252f2af78b6b] children=[#]> #<OffHeapInner -6738448406410719289 hash=[66bfedb588f87ad3957497728b91bd436af364e6ec40df3299d006de151ac092] children=[#]>]>, #<OffHeapInner -6738477457985515736 hash=[bf128431ddf7ad72417ce853f90abd0bc7a2a38088d5fbec864c02c516ed6458] children=[#<OffHeapInner -6738506509560312183 hash=[12a71b62ceb096369c4797f56edd91a16dd9803348b4f4cfe1454027d7dae3d2] children=[#]> #<OffHeapInner -6738448406410719289 hash=[adb59f5313473b44dd3b7fa697d72caaaa7b23b3c0610f23670942e2c137878a] children=[#]>]>]
When disabling debug logging, the duration dropped to ~2h05m with decent log sizes:
There's apparently too much logging for each inconsistency found in the Merkle tree comparisons and we should move this to TRACE level if we still want to allow debug logging to be turned on by default.
I'll prepare a patch for the MerkleTree class and run the repair testing scenarios again to verify their duration.