[SOLR-3685] Solr Cloud sometimes skipped peersync attempt and replicated instead due to tlog flags not being cleared when no updates were buffered during a previous replication. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 4.0-ALPHA
Fix Version/s: 4.0, 6.0
Component/s: replication (java), SolrCloud
Labels:
None
Environment:

Debian GNU/Linux Squeeze 64bit
Solr 5.0-SNAPSHOT 1365667M - markus - 2012-07-25 19:09:43

Description

There's a serious problem with restarting nodes, not cleaning old or unused index directories and sudden replication and Java being killed by the OS due to excessive memory allocation. Since ~~SOLR-1781~~ was fixed index directories get cleaned up when a node is being restarted cleanly, however, old or unused index directories still pile up if Solr crashes or is being killed by the OS, happening here.

We have a six-node 64-bit Linux test cluster with each node having two shards. There's 512MB RAM available and no swap. Each index is roughly 27MB so about 50MB per node, this fits easily and works fine. However, if a node is being restarted, Solr will consistently crash because it immediately eats up all RAM. If swap is enabled Solr will eat an additional few 100MB's right after start up.

This cannot be solved by restarting Solr, it will just crash again and leave index directories in place until the disk is full. The only way i can restart a node safely is to delete the index directories and have it replicate from another node. If i then restart the node it will crash almost consistently.

I'll attach a log of one of the nodes.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

pmap.log
17/Aug/12 10:25
49 kB
Markus Jelsma
oom-killer.log
16/Aug/12 18:31
14 kB
Markus Jelsma
info.log
27/Jul/12 10:33
438 kB
Markus Jelsma

Activity

People

Assignee:: Yonik Seeley

Reporter:: Markus Jelsma

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Jul/12 10:28

Updated:: 09/May/16 18:47

Resolved:: 22/Sep/12 13:12