[IGNITE-7165] Re-balancing is cancelled if client node joins - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.7
Component/s: None
Labels:
- rebalance

Description

Re-balancing is canceled if client node joins. Re-balancing can take hours and each time when client node joins it starts again:

[15:10:05,700][INFO]disco-event-worker-#61%statement_grid%[GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, /172.31.16.213:0], discPort=0, order=36, intOrder=24, lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, isClient=true]
[15:10:05,701][INFO]disco-event-worker-#61%statement_grid%[GridDiscoveryManager] Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
[15:10:05,702][INFO]exchange-worker-#62%statement_grid%[time] Started exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef, customEvt=null, allowMerge=true]
[15:10:05,702][INFO]exchange-worker-#62%statement_grid%[GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], err=null]
[15:10:05,702][INFO]exchange-worker-#62%statement_grid%[time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], crd=false]
[15:10:05,703][INFO]exchange-worker-#62%statement_grid%[GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=36, minorTopVer=0], evt=NODE_JOINED, node=979cf868-1c37-424a-9ad1-12db501f32ef]
[15:10:08,706][INFO]exchange-worker-#62%statement_grid%[GridDhtPartitionDemander] Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion [topVer=35, minorTopVer=0]]
[15:10:08,707][INFO]exchange-worker-#62%statement_grid%[GridCachePartitionExchangeManager] Rebalancing scheduled [order=[statementp]]
[15:10:08,707][INFO]exchange-worker-#62%statement_grid%[GridCachePartitionExchangeManager] Rebalancing started [top=null, evt=NODE_JOINED, node=a8be3c14-9add-48c3-b099-3fd304cfdbf4]
[15:10:08,707][INFO]exchange-worker-#62%statement_grid%[GridDhtPartitionDemander] Starting rebalancing [mode=ASYNC, fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18, topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]
[15:10:08,707][INFO]exchange-worker-#62%statement_grid%[GridDhtPartitionDemander] Starting rebalancing [mode=ASYNC, fromNode=35d01141-4dce-47dd-adf6-a4f3b2bb9da9, partitionsCount=15, topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]
[15:10:08,708][INFO]exchange-worker-#62%statement_grid%[GridDhtPartitionDemander] Starting rebalancing [mode=ASYNC, fromNode=b3a8be53-e61f-4023-a906-a265923837ba, partitionsCount=15, topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]
[15:10:08,708][INFO]exchange-worker-#62%statement_grid%[GridDhtPartitionDemander] Starting rebalancing [mode=ASYNC, fromNode=f825cb4e-7dcc-405f-a40d-c1dc1a3ade5a, partitionsCount=12, topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]
[15:10:08,708][INFO]exchange-worker-#62%statement_grid%[GridDhtPartitionDemander] Starting rebalancing [mode=ASYNC, fromNode=4ae1db91-8b88-4180-a84b-127a303959e9, partitionsCount=11, topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]
[15:10:08,708][INFO]exchange-worker-#62%statement_grid%[GridDhtPartitionDemander] Starting rebalancing [mode=ASYNC, fromNode=7c286481-7638-49e4-8c68-fa6aa65d8b76, partitionsCount=18, topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], updateSeq=-1754630006]

so in clusters with a big amount of data and the frequent client left/join events this means that a new server will never receive its partitions.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

GridCacheRebalancingCancelTestNoReproduce.java
16/Aug/18 15:01
12 kB
Dmitry Sherstobitov
node-2-jstack.log
15/Aug/18 11:36
142 kB
Dmitry Sherstobitov
node-NO_REBALANCE-7165.log
14/Aug/18 09:56
133 kB
Dmitry Sherstobitov

Issue Links

is related to

IGNITE-10374 Node doesn't own rebalanced partitions on rebalancing finished

Resolved

relates to

IGNITE-11803 Re-balancing is cancelled if client node joins, reinvestigate.

Open

IGNITE-9309 LocalNodeMovingPartitionsCount metrics may calculates incorrect due to processFullPartitionUpdate

Closed

IGNITE-11187 Additional documentation for re-balancing is canceled if client node joins.

Open

links to

GitHub Pull Request #3264

GitHub Pull Request #4097

GitHub Pull Request #4442

IGNT-CR-699

(3 links to)

Re-balancing is cancelled if client node joins

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates