[SOLR-10108] bin/solr script recursive copy broken - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.6, 7.0
Component/s: None
Labels:
None

Description

cp /r zk:/ fails with "cannot create //whatever".

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-10108.patch
27/Mar/17 19:16
22 kB
Erick Erickson
SOLR-10108.patch
16/Mar/17 00:50
22 kB
Erick Erickson
SOLR-10108.patch
20/Feb/17 04:36
24 kB
Erick Erickson

Issue Links

relates to

SOLR-10387 zkTransfer normalizes destination path incorrectly if source is a windows directory

Closed

Activity

Ascending order - Click to sort in descending order

Jan Høydahl added a comment - 13/Feb/17 09:07

What version? Seems to work here on master

$ bin/solr start -c
$ bin/solr zk -z localhost:9983 cp -r bin zk:/

Connecting to ZooKeeper at localhost:9983 ...
Copying from 'bin' to 'zk:/'. ZooKeeper at localhost:9983
$ bin/solr zk -z localhost:9983 ls -r /bin

Connecting to ZooKeeper at localhost:9983 ...
Getting listing for Zookeeper node /bin from ZooKeeper at localhost:9983 recurse: true
/bin
    post
    solr.in.cmd
    solr
    solr-8983.pid
    init.d
           solr
    solr.in.sh
    install_solr_service.sh
    solr.cmd
    oom_solr.sh

Jan Høydahl added a comment - 13/Feb/17 09:07 What version? Seems to work here on master $ bin/solr start -c $ bin/solr zk -z localhost:9983 cp -r bin zk:/ Connecting to ZooKeeper at localhost:9983 ... Copying from 'bin' to 'zk:/' . ZooKeeper at localhost:9983 $ bin/solr zk -z localhost:9983 ls -r /bin Connecting to ZooKeeper at localhost:9983 ... Getting listing for Zookeeper node /bin from ZooKeeper at localhost:9983 recurse: true /bin post solr.in.cmd solr solr-8983.pid init.d solr solr.in.sh install_solr_service.sh solr.cmd oom_solr.sh

Erick Erickson added a comment - 13/Feb/17 16:40

Jan:

Thanks for looking. I'm trying to go the other way (which I should have specified, just tried on 6x since I have it handy):

bin/solr zk cp -r zk:/ ~/eoezk -z localhost:2181

Fails with: ERROR: Invalid path string "//configs" caused by empty node name specified 1@c07.de

where
bin/solr zk cp -r zk:/whatever ~/eoezk -z localhost:2181
works.

The context here was a client who'd accidentally removed the zoo_data directory on all ZKs but still had them running. We hit on the bright idea "hey, we can just dump all of the ZK data since the data is still available until we restart the ZK nodes" but ran into this when trying to copy stuff down.

So I worked out the issues with the path and got the two-way copy to work, but also noticed another issue. Since ZK nodes can have data whether leaf nodes or not, the current process is lossy since non-leaf nodes don't get their data restored.

This makes it impossible to backup the collection node and restore it since the collection can have a configset name as data. My take is that copying back and forth should restore intermediate node's data, do you (and others) concur?

My first-attempt PoC is to create a very special file name something like node_zookeeper_solr.data to put any information associated with non-leaf nodes when the data is not empty as a PoC. That feels like a hack though as there's the possibility of collisions. Hmmm, maybe

{generated_guid}

.znode.solr.data? Still possibly a collision if someone, somehow managed to have a znode with a GUID followed by .znode.solr.data I suppose, but that's seems unlikely enough that I'm not willing to worry about it. How about "erick.erickson.was.here.data"? Maybe not.

WDYT

Erick Erickson added a comment - 13/Feb/17 16:40 Jan: Thanks for looking. I'm trying to go the other way (which I should have specified, just tried on 6x since I have it handy): bin/solr zk cp -r zk:/ ~/eoezk -z localhost:2181 Fails with: ERROR: Invalid path string "//configs" caused by empty node name specified 1@c07.de where bin/solr zk cp -r zk:/whatever ~/eoezk -z localhost:2181 works. The context here was a client who'd accidentally removed the zoo_data directory on all ZKs but still had them running. We hit on the bright idea "hey, we can just dump all of the ZK data since the data is still available until we restart the ZK nodes" but ran into this when trying to copy stuff down. So I worked out the issues with the path and got the two-way copy to work, but also noticed another issue. Since ZK nodes can have data whether leaf nodes or not, the current process is lossy since non-leaf nodes don't get their data restored. This makes it impossible to backup the collection node and restore it since the collection can have a configset name as data. My take is that copying back and forth should restore intermediate node's data, do you (and others) concur? My first-attempt PoC is to create a very special file name something like node_zookeeper_solr.data to put any information associated with non-leaf nodes when the data is not empty as a PoC. That feels like a hack though as there's the possibility of collisions. Hmmm, maybe {generated_guid} .znode.solr.data? Still possibly a collision if someone, somehow managed to have a znode with a GUID followed by .znode.solr.data I suppose, but that's seems unlikely enough that I'm not willing to worry about it. How about "erick.erickson.was.here.data"? Maybe not. WDYT

Erick Erickson added a comment - 20/Feb/17 04:36

Well, it works for me. If someone with a Windows setup could try the following I'd be grateful:

Set up a SolrCloud instance, just enough to get it running and see the cloud graph.

> 'bin/solr zk cp -r zk:/ file:/some/dir -z localhost:2181'
> Shut down Solr
> Re-initialize ZK (wipe all the data)
> 'bin/solr zk cp -r file:/some/dir/ zk:/ -z localhost:2181'
> start Solr again

If I'm lucky, that'll start Solr up seamlessly. The idea is that the above should save the entire ZK state to a local drive and then restore it. We had a case where the Solr nodes were still up but all the ZK data had been lost and this would have been simpler than what we had to do.

This is something of a side-effect of making the recursive copy just work to/from the root of ZK. There are problems with recursive copy in the first place when it's not from some node below the root without this patch.

This patch also handles the fact that non-leaf nodes in Solr can have data and properly saves/restores that data.

Erick Erickson added a comment - 20/Feb/17 04:36 Well, it works for me. If someone with a Windows setup could try the following I'd be grateful: Set up a SolrCloud instance, just enough to get it running and see the cloud graph. > 'bin/solr zk cp -r zk:/ file:/some/dir -z localhost:2181' > Shut down Solr > Re-initialize ZK (wipe all the data) > 'bin/solr zk cp -r file:/some/dir/ zk:/ -z localhost:2181' > start Solr again If I'm lucky, that'll start Solr up seamlessly. The idea is that the above should save the entire ZK state to a local drive and then restore it. We had a case where the Solr nodes were still up but all the ZK data had been lost and this would have been simpler than what we had to do. This is something of a side-effect of making the recursive copy just work to/from the root of ZK. There are problems with recursive copy in the first place when it's not from some node below the root without this patch. This patch also handles the fact that non-leaf nodes in Solr can have data and properly saves/restores that data.

Erick Erickson added a comment - 14/Mar/17 16:26

Still hoping someone will give this a whirl on Windows before committing... I'd like to get this in Solr 6.5 if/when.

Otherwise I'll commit it this week sometime.

Erick Erickson added a comment - 14/Mar/17 16:26 Still hoping someone will give this a whirl on Windows before committing... I'd like to get this in Solr 6.5 if/when. Otherwise I'll commit it this week sometime.

Jan Høydahl added a comment - 15/Mar/17 13:59

So I took it for a spin. I started Solr master with patch applied i cloud mode, then created a collection.
Then I copied all content of internal zk:/ (9983) to my disk /tmp/zoo. That worked.
Then I stopped Solr, wiped zoo_data
I then started a local ZK and copied everything onto zk:/solr (another root)
Then started Solr with bin/solr start -z localhost:2181/solr/zoo. BINGO

So your patch seems to work. But I could not figure out how to copy all files from my /tmp/zoo folder into zk root. Specifying solr zk cp /tmp/zoo/* zk:/ did not work:

Could not complete the zk operation for reason: Path /tmp/zoo/* does not exist

Jan Høydahl added a comment - 15/Mar/17 13:59 So I took it for a spin. I started Solr master with patch applied i cloud mode, then created a collection. Then I copied all content of internal zk:/ (9983) to my disk /tmp/zoo. That worked. Then I stopped Solr, wiped zoo_data I then started a local ZK and copied everything onto zk:/solr (another root) Then started Solr with bin/solr start -z localhost:2181/solr/zoo. BINGO So your patch seems to work. But I could not figure out how to copy all files from my /tmp/zoo folder into zk root. Specifying solr zk cp /tmp/zoo/* zk:/ did not work: Could not complete the zk operation for reason: Path /tmp/zoo/* does not exist

Erick Erickson added a comment - 15/Mar/17 15:16

Jan:

Thanks! try

solr zk cp /tmp/zoo zk:/
or
solr zk cp /tmp/zoo/ zk:/

I put a couple of examples in the help section, did I mess those up?

 to copy to local: 'bin/solr zk cp -r zk:/ /some/dir -z localhost:2181'
to restore to ZK: 'bin/solr zk cp -r /some/dir/ zk:/ -z localhost:2181

I struggled a bit with how much I wanted to recreate the Unix style cp commands, one of those cases where the differences will hurt you. The problem I remember having was that the shell gets in there and expands the '/tmp/*' form and gives me a list of the children of tmp that makes parsing the parameters "interesting".

This is *nix though, right?

Still hoping for a Windows user to give it a spin too.

Erick Erickson added a comment - 15/Mar/17 15:16 Jan: Thanks! try solr zk cp /tmp/zoo zk:/ or solr zk cp /tmp/zoo/ zk:/ I put a couple of examples in the help section, did I mess those up? to copy to local: 'bin/solr zk cp -r zk:/ /some/dir -z localhost:2181' to restore to ZK: 'bin/solr zk cp -r /some/dir/ zk:/ -z localhost:2181 I struggled a bit with how much I wanted to recreate the Unix style cp commands, one of those cases where the differences will hurt you. The problem I remember having was that the shell gets in there and expands the '/tmp/*' form and gives me a list of the children of tmp that makes parsing the parameters "interesting". This is *nix though, right? Still hoping for a Windows user to give it a spin too.

Jan Høydahl added a comment - 15/Mar/17 21:43 - edited

Thanks, putting the / at the end of the src path (no asterisk) did the trick.

The problem I remember having was that the shell gets in there and expands the '/tmp/*' form and gives me a list of the children of tmp that makes parsing the parameters "interesting".

To avoid the shell parsing the *, you can quote just that part, such as "/tmp/zoo/*" and it will be passed as-is to the script. But that's when you get the error

ERROR: Path /tmp/zoo/* does not exist

So we have an option to look for a local path ending with /*, and then simply remote the trailing asterisk - which will mimic the intention of the user. We could of course also try to detect whether there are more than two arguments after cp, i.e. something else than src and dst, and in such case print a help message that instead of calling out the 3rd unexpected argument as the problem, instead print a help like

ERROR: Too many arguments for command "cp". Note that the src and dst arguments do not support wildcard characters.

Jan Høydahl added a comment - 15/Mar/17 21:43 - edited Thanks, putting the / at the end of the src path (no asterisk) did the trick. The problem I remember having was that the shell gets in there and expands the '/tmp/*' form and gives me a list of the children of tmp that makes parsing the parameters "interesting". To avoid the shell parsing the * , you can quote just that part, such as "/tmp/zoo/*" and it will be passed as-is to the script. But that's when you get the error ERROR: Path /tmp/zoo/* does not exist So we have an option to look for a local path ending with /* , and then simply remote the trailing asterisk - which will mimic the intention of the user. We could of course also try to detect whether there are more than two arguments after cp , i.e. something else than src and dst, and in such case print a help message that instead of calling out the 3rd unexpected argument as the problem, instead print a help like ERROR: Too many arguments for command "cp". Note that the src and dst arguments do not support wildcard characters.

Erick Erickson added a comment - 16/Mar/17 00:50

Here's a quick patch that allows the local source to have a trailing wildcard. I also reworked the help screen and the error message. The param checking barfs on the first param it doesn't recognize, there's no specific handling for "cp" so I put in a plea to look at the help screen when there a non-quoted trailing wildcard.

I also reworked the help for 'cp' a bit, I hope it's a little clearer. I really didn't like the trailing slash stuff. It's still there but not the only way now.

I also noticed that the windows command file had trailing quotes and a bunch of other stuff so it'd be really helpful if someone who 1> knows windows commands and 2> has a windows box could give it a try, especially the trailing slash and trailing asterisk bits. I don't have any clue whether Windows will require quoting the trailing wildcard case.

Thanks Jan! Your suggestion was a good one.

Erick Erickson added a comment - 16/Mar/17 00:50 Here's a quick patch that allows the local source to have a trailing wildcard. I also reworked the help screen and the error message. The param checking barfs on the first param it doesn't recognize, there's no specific handling for "cp" so I put in a plea to look at the help screen when there a non-quoted trailing wildcard. I also reworked the help for 'cp' a bit, I hope it's a little clearer. I really didn't like the trailing slash stuff. It's still there but not the only way now. I also noticed that the windows command file had trailing quotes and a bunch of other stuff so it'd be really helpful if someone who 1> knows windows commands and 2> has a windows box could give it a try, especially the trailing slash and trailing asterisk bits. I don't have any clue whether Windows will require quoting the trailing wildcard case. Thanks Jan! Your suggestion was a good one.

Jan Høydahl added a comment - 16/Mar/17 09:41

I put in a plea to look at the help screen when there a non-quoted trailing wildcard.

How can you check this? I thought the shell would expand that before handling it over to our script. And I'm not sure it behaves the same way for Windows? But I cannot test Windows right now.

Jan Høydahl added a comment - 16/Mar/17 09:41 I put in a plea to look at the help screen when there a non-quoted trailing wildcard. How can you check this? I thought the shell would expand that before handling it over to our script. And I'm not sure it behaves the same way for Windows? But I cannot test Windows right now.

Erick Erickson added a comment - 16/Mar/17 14:51

bq: How can you check this?

Basically I don't. The query parsing in the script just fails when it encounters some of the file names since it can't recognize the parameter so I added the bit about wildcards to the generic failure message.

I've got a line on some other folks who can test Windows, thanks for your efforts here!

Erick

Erick Erickson added a comment - 16/Mar/17 14:51 bq: How can you check this? Basically I don't. The query parsing in the script just fails when it encounters some of the file names since it can't recognize the parameter so I added the bit about wildcards to the generic failure message. I've got a line on some other folks who can test Windows, thanks for your efforts here! Erick

Erick Erickson added a comment - 27/Mar/17 19:16

Final patch sans Windows testing. We can open new JIRAs if there are Windows issues.

Erick Erickson added a comment - 27/Mar/17 19:16 Final patch sans Windows testing. We can open new JIRAs if there are Windows issues.

Erick Erickson added a comment - 27/Mar/17 19:28

Rats, typo in commit message.

Trunk commit: 0b3ca1bb61cb6768ef99a4ee7f4ac05a71d19f56
6x commit: c22c8bdebb9ab2a5cdb9951ccdec98a0f46f705e

Erick Erickson added a comment - 27/Mar/17 19:28 Rats, typo in commit message. Trunk commit: 0b3ca1bb61cb6768ef99a4ee7f4ac05a71d19f56 6x commit: c22c8bdebb9ab2a5cdb9951ccdec98a0f46f705e

People

Assignee:: Erick Erickson

Reporter:: Erick Erickson

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Feb/17 03:15

Updated:: 08/Jun/19 15:31

Resolved:: 27/Mar/17 19:28