Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: master (9.0)
After spending a bit of time away from SolrCloud after being deeply involved in trying to stabilize it and it's tests, I came back in 2018 and went deep into the system with the Starburst upgrade.
What I found surprised me, though I guess it should not have. The system is slow, often silly, super buggy, not good at connection reuse or thread safety or efficient Zookeeper communication or efficient startup and shutdown.
Often, the things we do to make tests pass make things worse because you can't do things reasonably without some major code work and so we fight for tests passes, not correctness.
Twice now, I've seen the system in the shape it was supposed to take. FAST. Not bug free, but 100X more solid at least and much, much, much, much faster.
The current system is sick and actually getting worse under it's weight as more is shoveled on top. Even since 1.5 years ago, the problems are worse, not better. Tests will never pass. Yes, our tests where in pretty bad shape. But you can put them in the best shape possible and it won't matter. The system will still fail tests.
Sadly, I'm smart enough to know what has to be done, but not smart enough to keep my work around after addressing most of the problems twice.
Non the less, it's time to fix SolrCloud. It's not supposed to be this way. I've twice spent a week or two in a state with super fast SolrCloud. Super fast build system. Developmenet is actually fun. You actually have a chance. I'm talking tests you have never seen take under 45-60 seconds taking 5. Consistently. A different world.
I spent a lot of time after starburst making tests pass for me. Then a lot of time on a better build system that can help us improve development and good practices around the project. And then a lot of time making tests faster. These are important steps, but little itty bitty baby steps without addressing the core rot that is growing. We don't find a problem and fully understand what is up and craft a careful solution. We find something that we can toss into the grand canyon, listen to it bounce around for a while, and if nobody screams, we move on to the next thing. That's not necessarily anyone's choice, there is little else you can do until the system is fixed. When that happens we can start making smart changes instead of just shoving around the mess.
Twice I have made the current system fast. What happens first? Nothing works. The system doesn't know how to be fast. It doesn't have the thread safety or proper logic to be fast. And that is not a place I want to be.