Details
-
Improvement
-
Status: Resolved
-
High
-
Resolution: Fixed
-
None
-
Operability
-
Normal
-
Java8, Linux
-
None
-
Description
While testing 4.0-rc1 on a 12 i3en.2xlarge x 2 region (AWS us-east-1 and eu-west-1) cluster I attempted to run nodetool repair while the cluster was taking moderate read/write load.
The first time it worked as expected, but when I ran an incremental run the second time multiple nodes got stuck trying to compact the unrepaired sstables. They are now spinning with:
$ nt compactionstats pending tasks: 827 - acceptance_josephl.acceptance_josephl_cass4: 827 $ nt tpstats Pool Name Active Pending Completed Blocked All time blocked RequestResponseStage 0 0 422359133 0 0 MutationStage 0 0 164540628 0 0 ReadStage 0 0 198857844 0 0 CompactionExecutor 0 0 60782 0 0 $ tail system.log DEBUG [CompactionExecutor:684] 2021-03-31 15:13:59,902 LeveledManifest.java:292 - L0 is too far behind, performing size-tiering there first DEBUG [CompactionExecutor:684] 2021-03-31 15:13:59,908 LeveledManifest.java:292 - L0 is too far behind, performing size-tiering there first WARN [CompactionExecutor:684] 2021-03-31 15:13:59,912 LeveledCompactionStrategy.java:154 - Could not acquire references for compacting SSTables [BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11eba fb40b81cbd6fb3d/na-4826-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4872-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acc eptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4849-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4874-big-Data.db'), BigTableReader(path='/mnt/dat a/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4841-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4897-big-D ata.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4924-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e79 0917c11ebafb40b81cbd6fb3d/na-4837-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4926-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_j osephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4729-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4723-big-Data.db'), BigTableReader(path ='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4875-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na- 4922-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4920-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cas s4-6144e790917c11ebafb40b81cbd6fb3d/na-4869-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4823-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/ac ceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4846-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4873-big-Data.db'), BigTableR eader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4840-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cb d6fb3d/na-4833-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4829-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_j osephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4726-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4923-big-Data.db'), BigTableReader(path='/mnt/data/cassand ra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4925-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4905-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4876-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11eb afb40b81cbd6fb3d/na-4901-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4732-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/ac ceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4909-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4915-big-Data.db'), BigTableReader(path='/mnt/da ta/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4921-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4860-big- Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4693-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e7 90917c11ebafb40b81cbd6fb3d/na-4694-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4692-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_ josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4691-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4696-big-Data.db'), BigTableReader(pat h='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4697-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na -4700-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4698-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_ca ss4-6144e790917c11ebafb40b81cbd6fb3d/na-4688-big-Data.db'), BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4689-big-Data.db')] which is not a problem per se,unless it happens frequently, in which case it must be reported. Will retry later.
I've attached some starting breadcrumbs. I believe the issue is a potential race in marking sstables for compaction getting null back from tryModify which I think can only happen under a small number of circumstances. From the initial investigation it does appear that only the unrepaired products get into this state.
I have a heap dump containing the View state but it contains potentially sensitive infrastructure details so if you're debugging just message me in slack and I can send it to you directly.
The following mitigation appears to unstick the nodes via a forced full compaction:
nodetool stop COMPACTION; nodetool compact <ks> <table>
I'm not confident in this mitigation though.
Attachments
Attachments
Issue Links
- fixes
-
CASSANDRA-16638 compactions/repairs hangs (backport CASSANDRA-16552)
- Resolved