Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10108

bin/solr script recursive copy broken

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 6.6, 7.0
    • None
    • None

    Description

      cp /r zk:/ fails with "cannot create //whatever".

      Attachments

        1. SOLR-10108.patch
          22 kB
          Erick Erickson
        2. SOLR-10108.patch
          22 kB
          Erick Erickson
        3. SOLR-10108.patch
          24 kB
          Erick Erickson

        Issue Links

          Activity

            janhoy Jan Høydahl added a comment -

            What version? Seems to work here on master

            $ bin/solr start -c
            $ bin/solr zk -z localhost:9983 cp -r bin zk:/
            
            Connecting to ZooKeeper at localhost:9983 ...
            Copying from 'bin' to 'zk:/'. ZooKeeper at localhost:9983
            $ bin/solr zk -z localhost:9983 ls -r /bin
            
            Connecting to ZooKeeper at localhost:9983 ...
            Getting listing for Zookeeper node /bin from ZooKeeper at localhost:9983 recurse: true
            /bin
                post
                solr.in.cmd
                solr
                solr-8983.pid
                init.d
                       solr
                solr.in.sh
                install_solr_service.sh
                solr.cmd
                oom_solr.sh
            
            janhoy Jan Høydahl added a comment - What version? Seems to work here on master $ bin/solr start -c $ bin/solr zk -z localhost:9983 cp -r bin zk:/ Connecting to ZooKeeper at localhost:9983 ... Copying from 'bin' to 'zk:/' . ZooKeeper at localhost:9983 $ bin/solr zk -z localhost:9983 ls -r /bin Connecting to ZooKeeper at localhost:9983 ... Getting listing for Zookeeper node /bin from ZooKeeper at localhost:9983 recurse: true /bin post solr.in.cmd solr solr-8983.pid init.d solr solr.in.sh install_solr_service.sh solr.cmd oom_solr.sh
            erickerickson Erick Erickson added a comment -

            Jan:

            Thanks for looking. I'm trying to go the other way (which I should have specified, just tried on 6x since I have it handy):

            bin/solr zk cp -r zk:/ ~/eoezk -z localhost:2181

            Fails with: ERROR: Invalid path string "//configs" caused by empty node name specified 1@c07.de

            where
            bin/solr zk cp -r zk:/whatever ~/eoezk -z localhost:2181
            works.

            The context here was a client who'd accidentally removed the zoo_data directory on all ZKs but still had them running. We hit on the bright idea "hey, we can just dump all of the ZK data since the data is still available until we restart the ZK nodes" but ran into this when trying to copy stuff down.

            So I worked out the issues with the path and got the two-way copy to work, but also noticed another issue. Since ZK nodes can have data whether leaf nodes or not, the current process is lossy since non-leaf nodes don't get their data restored.

            This makes it impossible to backup the collection node and restore it since the collection can have a configset name as data. My take is that copying back and forth should restore intermediate node's data, do you (and others) concur?

            My first-attempt PoC is to create a very special file name something like node_zookeeper_solr.data to put any information associated with non-leaf nodes when the data is not empty as a PoC. That feels like a hack though as there's the possibility of collisions. Hmmm, maybe

            {generated_guid}

            .znode.solr.data? Still possibly a collision if someone, somehow managed to have a znode with a GUID followed by .znode.solr.data I suppose, but that's seems unlikely enough that I'm not willing to worry about it. How about "erick.erickson.was.here.data"? Maybe not.

            WDYT

            erickerickson Erick Erickson added a comment - Jan: Thanks for looking. I'm trying to go the other way (which I should have specified, just tried on 6x since I have it handy): bin/solr zk cp -r zk:/ ~/eoezk -z localhost:2181 Fails with: ERROR: Invalid path string "//configs" caused by empty node name specified 1@c07.de where bin/solr zk cp -r zk:/whatever ~/eoezk -z localhost:2181 works. The context here was a client who'd accidentally removed the zoo_data directory on all ZKs but still had them running. We hit on the bright idea "hey, we can just dump all of the ZK data since the data is still available until we restart the ZK nodes" but ran into this when trying to copy stuff down. So I worked out the issues with the path and got the two-way copy to work, but also noticed another issue. Since ZK nodes can have data whether leaf nodes or not, the current process is lossy since non-leaf nodes don't get their data restored. This makes it impossible to backup the collection node and restore it since the collection can have a configset name as data. My take is that copying back and forth should restore intermediate node's data, do you (and others) concur? My first-attempt PoC is to create a very special file name something like node_zookeeper_solr.data to put any information associated with non-leaf nodes when the data is not empty as a PoC. That feels like a hack though as there's the possibility of collisions. Hmmm, maybe {generated_guid} .znode.solr.data? Still possibly a collision if someone, somehow managed to have a znode with a GUID followed by .znode.solr.data I suppose, but that's seems unlikely enough that I'm not willing to worry about it. How about "erick.erickson.was.here.data"? Maybe not. WDYT
            erickerickson Erick Erickson added a comment -

            Well, it works for me. If someone with a Windows setup could try the following I'd be grateful:

            Set up a SolrCloud instance, just enough to get it running and see the cloud graph.

            > 'bin/solr zk cp -r zk:/ file:/some/dir -z localhost:2181'
            > Shut down Solr
            > Re-initialize ZK (wipe all the data)
            > 'bin/solr zk cp -r file:/some/dir/ zk:/ -z localhost:2181'
            > start Solr again

            If I'm lucky, that'll start Solr up seamlessly. The idea is that the above should save the entire ZK state to a local drive and then restore it. We had a case where the Solr nodes were still up but all the ZK data had been lost and this would have been simpler than what we had to do.

            This is something of a side-effect of making the recursive copy just work to/from the root of ZK. There are problems with recursive copy in the first place when it's not from some node below the root without this patch.

            This patch also handles the fact that non-leaf nodes in Solr can have data and properly saves/restores that data.

            erickerickson Erick Erickson added a comment - Well, it works for me. If someone with a Windows setup could try the following I'd be grateful: Set up a SolrCloud instance, just enough to get it running and see the cloud graph. > 'bin/solr zk cp -r zk:/ file:/some/dir -z localhost:2181' > Shut down Solr > Re-initialize ZK (wipe all the data) > 'bin/solr zk cp -r file:/some/dir/ zk:/ -z localhost:2181' > start Solr again If I'm lucky, that'll start Solr up seamlessly. The idea is that the above should save the entire ZK state to a local drive and then restore it. We had a case where the Solr nodes were still up but all the ZK data had been lost and this would have been simpler than what we had to do. This is something of a side-effect of making the recursive copy just work to/from the root of ZK. There are problems with recursive copy in the first place when it's not from some node below the root without this patch. This patch also handles the fact that non-leaf nodes in Solr can have data and properly saves/restores that data.
            erickerickson Erick Erickson added a comment -

            Still hoping someone will give this a whirl on Windows before committing... I'd like to get this in Solr 6.5 if/when.

            Otherwise I'll commit it this week sometime.

            erickerickson Erick Erickson added a comment - Still hoping someone will give this a whirl on Windows before committing... I'd like to get this in Solr 6.5 if/when. Otherwise I'll commit it this week sometime.
            janhoy Jan Høydahl added a comment -

            So I took it for a spin. I started Solr master with patch applied i cloud mode, then created a collection.
            Then I copied all content of internal zk:/ (9983) to my disk /tmp/zoo. That worked.
            Then I stopped Solr, wiped zoo_data
            I then started a local ZK and copied everything onto zk:/solr (another root)
            Then started Solr with bin/solr start -z localhost:2181/solr/zoo. BINGO

            So your patch seems to work. But I could not figure out how to copy all files from my /tmp/zoo folder into zk root. Specifying solr zk cp /tmp/zoo/* zk:/ did not work:

            Could not complete the zk operation for reason: Path /tmp/zoo/* does not exist

            janhoy Jan Høydahl added a comment - So I took it for a spin. I started Solr master with patch applied i cloud mode, then created a collection. Then I copied all content of internal zk:/ (9983) to my disk /tmp/zoo. That worked. Then I stopped Solr, wiped zoo_data I then started a local ZK and copied everything onto zk:/solr (another root) Then started Solr with bin/solr start -z localhost:2181/solr/zoo. BINGO So your patch seems to work. But I could not figure out how to copy all files from my /tmp/zoo folder into zk root. Specifying solr zk cp /tmp/zoo/* zk:/ did not work: Could not complete the zk operation for reason: Path /tmp/zoo/* does not exist
            erickerickson Erick Erickson added a comment -

            Jan:

            Thanks! try

            solr zk cp /tmp/zoo zk:/
            or
            solr zk cp /tmp/zoo/ zk:/
            

            I put a couple of examples in the help section, did I mess those up?

             to copy to local: 'bin/solr zk cp -r zk:/ /some/dir -z localhost:2181'
            to restore to ZK: 'bin/solr zk cp -r /some/dir/ zk:/ -z localhost:2181
            

            I struggled a bit with how much I wanted to recreate the Unix style cp commands, one of those cases where the differences will hurt you. The problem I remember having was that the shell gets in there and expands the '/tmp/*' form and gives me a list of the children of tmp that makes parsing the parameters "interesting".

            This is *nix though, right?

            Still hoping for a Windows user to give it a spin too.

            erickerickson Erick Erickson added a comment - Jan: Thanks! try solr zk cp /tmp/zoo zk:/ or solr zk cp /tmp/zoo/ zk:/ I put a couple of examples in the help section, did I mess those up? to copy to local: 'bin/solr zk cp -r zk:/ /some/dir -z localhost:2181' to restore to ZK: 'bin/solr zk cp -r /some/dir/ zk:/ -z localhost:2181 I struggled a bit with how much I wanted to recreate the Unix style cp commands, one of those cases where the differences will hurt you. The problem I remember having was that the shell gets in there and expands the '/tmp/*' form and gives me a list of the children of tmp that makes parsing the parameters "interesting". This is *nix though, right? Still hoping for a Windows user to give it a spin too.
            janhoy Jan Høydahl added a comment - - edited

            Thanks, putting the / at the end of the src path (no asterisk) did the trick.

            The problem I remember having was that the shell gets in there and expands the '/tmp/*' form and gives me a list of the children of tmp that makes parsing the parameters "interesting".

            To avoid the shell parsing the *, you can quote just that part, such as "/tmp/zoo/*" and it will be passed as-is to the script. But that's when you get the error

            ERROR: Path /tmp/zoo/* does not exist
            

            So we have an option to look for a local path ending with /*, and then simply remote the trailing asterisk - which will mimic the intention of the user. We could of course also try to detect whether there are more than two arguments after cp, i.e. something else than src and dst, and in such case print a help message that instead of calling out the 3rd unexpected argument as the problem, instead print a help like

            ERROR: Too many arguments for command "cp". Note that the src and dst arguments do not support wildcard characters.
            
            janhoy Jan Høydahl added a comment - - edited Thanks, putting the / at the end of the src path (no asterisk) did the trick. The problem I remember having was that the shell gets in there and expands the '/tmp/*' form and gives me a list of the children of tmp that makes parsing the parameters "interesting". To avoid the shell parsing the * , you can quote just that part, such as "/tmp/zoo/*" and it will be passed as-is to the script. But that's when you get the error ERROR: Path /tmp/zoo/* does not exist So we have an option to look for a local path ending with /* , and then simply remote the trailing asterisk - which will mimic the intention of the user. We could of course also try to detect whether there are more than two arguments after cp , i.e. something else than src and dst, and in such case print a help message that instead of calling out the 3rd unexpected argument as the problem, instead print a help like ERROR: Too many arguments for command "cp". Note that the src and dst arguments do not support wildcard characters.
            erickerickson Erick Erickson added a comment -

            Here's a quick patch that allows the local source to have a trailing wildcard. I also reworked the help screen and the error message. The param checking barfs on the first param it doesn't recognize, there's no specific handling for "cp" so I put in a plea to look at the help screen when there a non-quoted trailing wildcard.

            I also reworked the help for 'cp' a bit, I hope it's a little clearer. I really didn't like the trailing slash stuff. It's still there but not the only way now.

            I also noticed that the windows command file had trailing quotes and a bunch of other stuff so it'd be really helpful if someone who 1> knows windows commands and 2> has a windows box could give it a try, especially the trailing slash and trailing asterisk bits. I don't have any clue whether Windows will require quoting the trailing wildcard case.

            Thanks Jan! Your suggestion was a good one.

            erickerickson Erick Erickson added a comment - Here's a quick patch that allows the local source to have a trailing wildcard. I also reworked the help screen and the error message. The param checking barfs on the first param it doesn't recognize, there's no specific handling for "cp" so I put in a plea to look at the help screen when there a non-quoted trailing wildcard. I also reworked the help for 'cp' a bit, I hope it's a little clearer. I really didn't like the trailing slash stuff. It's still there but not the only way now. I also noticed that the windows command file had trailing quotes and a bunch of other stuff so it'd be really helpful if someone who 1> knows windows commands and 2> has a windows box could give it a try, especially the trailing slash and trailing asterisk bits. I don't have any clue whether Windows will require quoting the trailing wildcard case. Thanks Jan! Your suggestion was a good one.
            janhoy Jan Høydahl added a comment -

            I put in a plea to look at the help screen when there a non-quoted trailing wildcard.

            How can you check this? I thought the shell would expand that before handling it over to our script. And I'm not sure it behaves the same way for Windows? But I cannot test Windows right now.

            janhoy Jan Høydahl added a comment - I put in a plea to look at the help screen when there a non-quoted trailing wildcard. How can you check this? I thought the shell would expand that before handling it over to our script. And I'm not sure it behaves the same way for Windows? But I cannot test Windows right now.
            erickerickson Erick Erickson added a comment -

            bq: How can you check this?

            Basically I don't. The query parsing in the script just fails when it encounters some of the file names since it can't recognize the parameter so I added the bit about wildcards to the generic failure message.

            I've got a line on some other folks who can test Windows, thanks for your efforts here!

            Erick

            erickerickson Erick Erickson added a comment - bq: How can you check this? Basically I don't. The query parsing in the script just fails when it encounters some of the file names since it can't recognize the parameter so I added the bit about wildcards to the generic failure message. I've got a line on some other folks who can test Windows, thanks for your efforts here! Erick
            erickerickson Erick Erickson added a comment -

            Final patch sans Windows testing. We can open new JIRAs if there are Windows issues.

            erickerickson Erick Erickson added a comment - Final patch sans Windows testing. We can open new JIRAs if there are Windows issues.
            erickerickson Erick Erickson added a comment -

            Rats, typo in commit message.

            Trunk commit: 0b3ca1bb61cb6768ef99a4ee7f4ac05a71d19f56
            6x commit: c22c8bdebb9ab2a5cdb9951ccdec98a0f46f705e

            erickerickson Erick Erickson added a comment - Rats, typo in commit message. Trunk commit: 0b3ca1bb61cb6768ef99a4ee7f4ac05a71d19f56 6x commit: c22c8bdebb9ab2a5cdb9951ccdec98a0f46f705e

            People

              erickerickson Erick Erickson
              erickerickson Erick Erickson
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: