Description
cp /r zk:/ fails with "cannot create //whatever".
Attachments
Attachments
- SOLR-10108.patch
- 22 kB
- Erick Erickson
- SOLR-10108.patch
- 22 kB
- Erick Erickson
- SOLR-10108.patch
- 24 kB
- Erick Erickson
Issue Links
- relates to
-
SOLR-10387 zkTransfer normalizes destination path incorrectly if source is a windows directory
- Closed
Activity
Jan:
Thanks for looking. I'm trying to go the other way (which I should have specified, just tried on 6x since I have it handy):
bin/solr zk cp -r zk:/ ~/eoezk -z localhost:2181
Fails with: ERROR: Invalid path string "//configs" caused by empty node name specified 1@c07.de
where
bin/solr zk cp -r zk:/whatever ~/eoezk -z localhost:2181
works.
The context here was a client who'd accidentally removed the zoo_data directory on all ZKs but still had them running. We hit on the bright idea "hey, we can just dump all of the ZK data since the data is still available until we restart the ZK nodes" but ran into this when trying to copy stuff down.
So I worked out the issues with the path and got the two-way copy to work, but also noticed another issue. Since ZK nodes can have data whether leaf nodes or not, the current process is lossy since non-leaf nodes don't get their data restored.
This makes it impossible to backup the collection node and restore it since the collection can have a configset name as data. My take is that copying back and forth should restore intermediate node's data, do you (and others) concur?
My first-attempt PoC is to create a very special file name something like node_zookeeper_solr.data to put any information associated with non-leaf nodes when the data is not empty as a PoC. That feels like a hack though as there's the possibility of collisions. Hmmm, maybe
{generated_guid}.znode.solr.data? Still possibly a collision if someone, somehow managed to have a znode with a GUID followed by .znode.solr.data I suppose, but that's seems unlikely enough that I'm not willing to worry about it. How about "erick.erickson.was.here.data"? Maybe not.
WDYT
Well, it works for me. If someone with a Windows setup could try the following I'd be grateful:
Set up a SolrCloud instance, just enough to get it running and see the cloud graph.
> 'bin/solr zk cp -r zk:/ file:/some/dir -z localhost:2181'
> Shut down Solr
> Re-initialize ZK (wipe all the data)
> 'bin/solr zk cp -r file:/some/dir/ zk:/ -z localhost:2181'
> start Solr again
If I'm lucky, that'll start Solr up seamlessly. The idea is that the above should save the entire ZK state to a local drive and then restore it. We had a case where the Solr nodes were still up but all the ZK data had been lost and this would have been simpler than what we had to do.
This is something of a side-effect of making the recursive copy just work to/from the root of ZK. There are problems with recursive copy in the first place when it's not from some node below the root without this patch.
This patch also handles the fact that non-leaf nodes in Solr can have data and properly saves/restores that data.
Still hoping someone will give this a whirl on Windows before committing... I'd like to get this in Solr 6.5 if/when.
Otherwise I'll commit it this week sometime.
So I took it for a spin. I started Solr master with patch applied i cloud mode, then created a collection.
Then I copied all content of internal zk:/ (9983) to my disk /tmp/zoo. That worked.
Then I stopped Solr, wiped zoo_data
I then started a local ZK and copied everything onto zk:/solr (another root)
Then started Solr with bin/solr start -z localhost:2181/solr/zoo. BINGO
So your patch seems to work. But I could not figure out how to copy all files from my /tmp/zoo folder into zk root. Specifying solr zk cp /tmp/zoo/* zk:/ did not work:
Could not complete the zk operation for reason: Path /tmp/zoo/* does not exist
Jan:
Thanks! try
solr zk cp /tmp/zoo zk:/ or solr zk cp /tmp/zoo/ zk:/
I put a couple of examples in the help section, did I mess those up?
to copy to local: 'bin/solr zk cp -r zk:/ /some/dir -z localhost:2181'
to restore to ZK: 'bin/solr zk cp -r /some/dir/ zk:/ -z localhost:2181
I struggled a bit with how much I wanted to recreate the Unix style cp commands, one of those cases where the differences will hurt you. The problem I remember having was that the shell gets in there and expands the '/tmp/*' form and gives me a list of the children of tmp that makes parsing the parameters "interesting".
This is *nix though, right?
Still hoping for a Windows user to give it a spin too.
Thanks, putting the / at the end of the src path (no asterisk) did the trick.
The problem I remember having was that the shell gets in there and expands the '/tmp/*' form and gives me a list of the children of tmp that makes parsing the parameters "interesting".
To avoid the shell parsing the *, you can quote just that part, such as "/tmp/zoo/*" and it will be passed as-is to the script. But that's when you get the error
ERROR: Path /tmp/zoo/* does not exist
So we have an option to look for a local path ending with /*, and then simply remote the trailing asterisk - which will mimic the intention of the user. We could of course also try to detect whether there are more than two arguments after cp, i.e. something else than src and dst, and in such case print a help message that instead of calling out the 3rd unexpected argument as the problem, instead print a help like
ERROR: Too many arguments for command "cp". Note that the src and dst arguments do not support wildcard characters.
Here's a quick patch that allows the local source to have a trailing wildcard. I also reworked the help screen and the error message. The param checking barfs on the first param it doesn't recognize, there's no specific handling for "cp" so I put in a plea to look at the help screen when there a non-quoted trailing wildcard.
I also reworked the help for 'cp' a bit, I hope it's a little clearer. I really didn't like the trailing slash stuff. It's still there but not the only way now.
I also noticed that the windows command file had trailing quotes and a bunch of other stuff so it'd be really helpful if someone who 1> knows windows commands and 2> has a windows box could give it a try, especially the trailing slash and trailing asterisk bits. I don't have any clue whether Windows will require quoting the trailing wildcard case.
Thanks Jan! Your suggestion was a good one.
I put in a plea to look at the help screen when there a non-quoted trailing wildcard.
How can you check this? I thought the shell would expand that before handling it over to our script. And I'm not sure it behaves the same way for Windows? But I cannot test Windows right now.
bq: How can you check this?
Basically I don't. The query parsing in the script just fails when it encounters some of the file names since it can't recognize the parameter so I added the bit about wildcards to the generic failure message.
I've got a line on some other folks who can test Windows, thanks for your efforts here!
Erick
Final patch sans Windows testing. We can open new JIRAs if there are Windows issues.
Rats, typo in commit message.
Trunk commit: 0b3ca1bb61cb6768ef99a4ee7f4ac05a71d19f56
6x commit: c22c8bdebb9ab2a5cdb9951ccdec98a0f46f705e
What version? Seems to work here on master