Solr
  1. Solr
  2. SOLR-7294

Migrate API fails with: Invalid status request: notfoundretried 6times

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.10.4, 5.0
    • Fix Version/s: 5.1, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Steps to reproduce:

      1. Create a two node cluster
      2. Create a collection called "source" with 1 shard, 1 replica
      3. Add 1000 docs with prefix a!
      4. Add 100 docs with prefix b! and c! each
      5. Create a new target collection with 1 shard, 1 replica and ensure that it is created on a different node than "source"
      6. Issue a migrate API call with an async parameter:
        http://localhost:8983/solr/admin/collections?action=migrate&split.key=a!&collection=gettingstarted&target.collection=target&wt=json&async=acid
        

      The above fails with:

      ERROR - 2015-03-23 22:50:11.349; org.apache.solr.common.SolrException; Collection: gettingstarted operation: migrate failed:org.apache.solr.common.SolrException: Invalid status request: notfoundretried 6times
              at org.apache.solr.cloud.OverseerCollectionProcessor.waitForCoreAdminAsyncCallToComplete(OverseerCollectionProcessor.java:2807)
              at org.apache.solr.cloud.OverseerCollectionProcessor.waitForAsyncCallsToComplete(OverseerCollectionProcessor.java:2753)
              at org.apache.solr.cloud.OverseerCollectionProcessor.completeAsyncRequest(OverseerCollectionProcessor.java:2229)
              at org.apache.solr.cloud.OverseerCollectionProcessor.migrateKey(OverseerCollectionProcessor.java:2200)
              at org.apache.solr.cloud.OverseerCollectionProcessor.migrate(OverseerCollectionProcessor.java:1984)
              at org.apache.solr.cloud.OverseerCollectionProcessor.processMessage(OverseerCollectionProcessor.java:637)
              at org.apache.solr.cloud.OverseerCollectionProcessor$Runner.run(OverseerCollectionProcessor.java:2864)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      

      Thanks to Jessica Cheng Mallet for finding this bug.

      1. SOLR-7294.patch
        13 kB
        Shalin Shekhar Mangar
      2. SOLR-7294.patch
        7 kB
        Shalin Shekhar Mangar
      3. source-leader.log
        409 kB
        Shalin Shekhar Mangar
      4. target.log
        87 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        Shalin Shekhar Mangar added a comment -

        Logs attached from the source leader and the target node.

        Show
        Shalin Shekhar Mangar added a comment - Logs attached from the source leader and the target node.
        Hide
        Shalin Shekhar Mangar added a comment -

        The problem is in the OCP.migrateKey method:

        log.info("Requesting merge of temp source collection replica to target leader");
            params = new ModifiableSolrParams();
            params.set(CoreAdminParams.ACTION, CoreAdminAction.MERGEINDEXES.toString());
            params.set(CoreAdminParams.CORE, targetLeader.getStr("core"));
            params.set(CoreAdminParams.SRC_CORE, tempCollectionReplica2);
        
            setupAsyncRequest(asyncId, requestMap, params, sourceLeader.getNodeName());
        
            sendShardRequest(targetLeader.getNodeName(), params, shardHandler);
            collectShardResponses(results, true,
                "MIGRATE failed to merge " + tempCollectionReplica2 +
                    " to " + targetLeader.getStr("core") + " on node: " + targetLeader.getNodeName(),
                shardHandler);
        
            completeAsyncRequest(asyncId, requestMap, results);
        

        Notice that the setupAsyncRequest is being called with sourceLeader.getNodeName() but the actual request is being sent to the targetLeader.getNodeName(). So fixing this part is easy enough.

        I tried to see why our existing AsyncMigrateRouteKey test doesn't tickle this problem and I was surprised that the test asks for the wrong node but always gets the right status. Then I realized that it is because all the nodes in our tests are loaded by the same classloader and since the core admin keeps the requests in a static map, any node can give the status of an async core admin API call. The request map in CoreAdminHandler doesn't need to be static. Once I changed the request map to be an instance variable, this problem is reproduced easily by the existing test.

        We should refactor the code in OCP such that these situations become impossible. I'll put up a patch.

        I'll also create an issue to enforce a different class loader for each jetty.

        Show
        Shalin Shekhar Mangar added a comment - The problem is in the OCP.migrateKey method: log.info( "Requesting merge of temp source collection replica to target leader" ); params = new ModifiableSolrParams(); params.set(CoreAdminParams.ACTION, CoreAdminAction.MERGEINDEXES.toString()); params.set(CoreAdminParams.CORE, targetLeader.getStr( "core" )); params.set(CoreAdminParams.SRC_CORE, tempCollectionReplica2); setupAsyncRequest(asyncId, requestMap, params, sourceLeader.getNodeName()); sendShardRequest(targetLeader.getNodeName(), params, shardHandler); collectShardResponses(results, true , "MIGRATE failed to merge " + tempCollectionReplica2 + " to " + targetLeader.getStr( "core" ) + " on node: " + targetLeader.getNodeName(), shardHandler); completeAsyncRequest(asyncId, requestMap, results); Notice that the setupAsyncRequest is being called with sourceLeader.getNodeName() but the actual request is being sent to the targetLeader.getNodeName(). So fixing this part is easy enough. I tried to see why our existing AsyncMigrateRouteKey test doesn't tickle this problem and I was surprised that the test asks for the wrong node but always gets the right status. Then I realized that it is because all the nodes in our tests are loaded by the same classloader and since the core admin keeps the requests in a static map, any node can give the status of an async core admin API call. The request map in CoreAdminHandler doesn't need to be static. Once I changed the request map to be an instance variable, this problem is reproduced easily by the existing test. We should refactor the code in OCP such that these situations become impossible. I'll put up a patch. I'll also create an issue to enforce a different class loader for each jetty.
        Hide
        Shalin Shekhar Mangar added a comment -

        Changes:

        1. Changed the formatting of the exception message
        2. Record the right (target) node name during migrate in the request map
        3. Remove static request map from CoreAdminHandler and instead use instance variable
        Show
        Shalin Shekhar Mangar added a comment - Changes: Changed the formatting of the exception message Record the right (target) node name during migrate in the request map Remove static request map from CoreAdminHandler and instead use instance variable
        Hide
        Shalin Shekhar Mangar added a comment -

        I folded in the setupAsyncRequest method inside sendShardRequest so that this kind of bugs can't happen.

        All tests pass. I think this is ready.

        Show
        Shalin Shekhar Mangar added a comment - I folded in the setupAsyncRequest method inside sendShardRequest so that this kind of bugs can't happen. All tests pass. I think this is ready.
        Hide
        ASF subversion and git services added a comment -

        Commit 1668956 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1668956 ]

        SOLR-7294: Migrate API fails with 'Invalid status request: notfoundretried 6times' message

        Show
        ASF subversion and git services added a comment - Commit 1668956 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1668956 ] SOLR-7294 : Migrate API fails with 'Invalid status request: notfoundretried 6times' message
        Hide
        ASF subversion and git services added a comment -

        Commit 1668957 from shalin@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1668957 ]

        SOLR-7294: Migrate API fails with 'Invalid status request: notfoundretried 6times' message

        Show
        ASF subversion and git services added a comment - Commit 1668957 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1668957 ] SOLR-7294 : Migrate API fails with 'Invalid status request: notfoundretried 6times' message
        Hide
        Timothy Potter added a comment -

        Bulk close after 5.1 release

        Show
        Timothy Potter added a comment - Bulk close after 5.1 release

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Shalin Shekhar Mangar
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development