Solr
  1. Solr
  2. SOLR-4744

Version conflict error during shard split test

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.3
    • Fix Version/s: 4.3.1, 4.4
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      ShardSplitTest fails sometimes with the following error:

      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state invoked for collection: collection1
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1 to inactive
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1_0 to active
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.861; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Update shard state shard1_1 to active
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.873; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp= path=/update params={wt=javabin&version=2} {add=[169 (1432319507166134272)]} 0 2
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.877; org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.884; org.apache.solr.update.processor.LogUpdateProcessor; [collection1_shard1_1_replica1] webapp= path=/update params={distrib.from=http://127.0.0.1:41028/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {} 0 1
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.885; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp= path=/update params={distrib.from=http://127.0.0.1:41028/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {add=[169 (1432319507173474304)]} 0 2
      [junit4:junit4]   1> ERROR - 2013-04-14 19:05:26.885; org.apache.solr.common.SolrException; shard update error StdNode: http://127.0.0.1:41028/collection1_shard1_1_replica1/:org.apache.solr.common.SolrException: version conflict for 169 expected=1432319507173474304 actual=-1
      [junit4:junit4]   1> 	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:404)
      [junit4:junit4]   1> 	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
      [junit4:junit4]   1> 	at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
      [junit4:junit4]   1> 	at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
      [junit4:junit4]   1> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      [junit4:junit4]   1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
      [junit4:junit4]   1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      [junit4:junit4]   1> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      [junit4:junit4]   1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
      [junit4:junit4]   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
      [junit4:junit4]   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      [junit4:junit4]   1> 	at java.lang.Thread.run(Thread.java:679)
      [junit4:junit4]   1> 
      [junit4:junit4]   1> INFO  - 2013-04-14 19:05:26.886; org.apache.solr.update.processor.DistributedUpdateProcessor; try and ask http://127.0.0.1:41028 to recover
      

      The failure is hard to reproduce and very timing sensitive. These kind of failures have always been seen right after "updateshardstate" action.

      1. SOLR-4744__no_more_NPE.patch
        1 kB
        Hoss Man
      2. SOLR-4744.patch
        19 kB
        Shalin Shekhar Mangar
      3. SOLR-4744.patch
        14 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          Shalin Shekhar Mangar added a comment -

          Consider the following scenario

          1. The overseer collection processor asks overseer to update the state of parent to INACTIVE and the sub shards to ACTIVE
          2. The parent shard leader receives an update request
          3. The parent shard leader thinks that it is still the leader of an ACTIVE shard and therefore tries to send the request to the sub shard leaders (FROMLEADER update containing "from.shard.parent" param). This is done asynchronously so the client has already been given a success status.
          4. The sub shard leader receives such a request but it's cluster state is already up to date and therefore rejects the update saying that it is already a leader and not in construction state any more.
          5. The parent shard leader asks sub shard leader to recover which is basically no-op for sub shard leaders
          6. The sub shard misses such a document update

          SOLR-4795 exposed the underlying problem clearly. The exceptions in the log on jenkins are now:

          [junit4:junit4]   1> INFO  - 2013-05-10 17:12:00.128; org.apache.solr.update.processor.LogUpdateProcessor; [collection1_shard1_1_replica1] webapp=/sx path=/update params={distrib.from=http://127.0.0.1:47193/sx/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {} 0 1
          [junit4:junit4]   1> INFO  - 2013-05-10 17:12:00.128; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/sx path=/update params={distrib.from=http://127.0.0.1:47193/sx/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {add=[296 (1434667890899943424)]} 0 1
          [junit4:junit4]   1> ERROR - 2013-05-10 17:12:00.129; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Request says it is coming from parent shard leader but we are not in construction state
          [junit4:junit4]   1> 	at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:327)
          [junit4:junit4]   1> 	at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:232)
          [junit4:junit4]   1> 	at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:394)
          [junit4:junit4]   1> 	at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
          [junit4:junit4]   1> 	at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
          [junit4:junit4]   1> 	at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
          [junit4:junit4]   1> 	at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
          [junit4:junit4]   1> 	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
          [junit4:junit4]   1> 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
          [junit4:junit4]   1> 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1832)
          [junit4:junit4]   1> 	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
          [junit4:junit4]   1> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
          [junit4:junit4]   1> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
          [junit4:junit4]   1> 	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
          

          These happen right after the state is switched until both parent leader and sub shard leader have the latest cluster state.

          The possible fixes are:

          1. Create a new recovery strategy for sub shard replication
          2. Replicate to sub shard leader synchronously (before local update)
          3. Switch parent shard to INACTIVE first, wait for it to receive the cluster state and then switch sub shards to ACTIVE – Clients would receive failures on updates for a short time but such failures should already be handled by clients (because of host failures), we should be okay. Sub shards failures must be handled so that we always end up with the shard range being available somewhere.

          Thoughts? Yonik Seeley, Mark Miller, Anshum Gupta

          Show
          Shalin Shekhar Mangar added a comment - Consider the following scenario The overseer collection processor asks overseer to update the state of parent to INACTIVE and the sub shards to ACTIVE The parent shard leader receives an update request The parent shard leader thinks that it is still the leader of an ACTIVE shard and therefore tries to send the request to the sub shard leaders (FROMLEADER update containing "from.shard.parent" param). This is done asynchronously so the client has already been given a success status. The sub shard leader receives such a request but it's cluster state is already up to date and therefore rejects the update saying that it is already a leader and not in construction state any more. The parent shard leader asks sub shard leader to recover which is basically no-op for sub shard leaders The sub shard misses such a document update SOLR-4795 exposed the underlying problem clearly. The exceptions in the log on jenkins are now: [junit4:junit4] 1> INFO - 2013-05-10 17:12:00.128; org.apache.solr.update.processor.LogUpdateProcessor; [collection1_shard1_1_replica1] webapp=/sx path=/update params={distrib.from=http: //127.0.0.1:47193/sx/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {} 0 1 [junit4:junit4] 1> INFO - 2013-05-10 17:12:00.128; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/sx path=/update params={distrib.from=http: //127.0.0.1:47193/sx/collection1/&update.distrib=FROMLEADER&wt=javabin&distrib.from.parent=shard1&version=2} {add=[296 (1434667890899943424)]} 0 1 [junit4:junit4] 1> ERROR - 2013-05-10 17:12:00.129; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Request says it is coming from parent shard leader but we are not in construction state [junit4:junit4] 1> at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:327) [junit4:junit4] 1> at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:232) [junit4:junit4] 1> at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:394) [junit4:junit4] 1> at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) [junit4:junit4] 1> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) [junit4:junit4] 1> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) [junit4:junit4] 1> at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) [junit4:junit4] 1> at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) [junit4:junit4] 1> at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) [junit4:junit4] 1> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1832) [junit4:junit4] 1> at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) [junit4:junit4] 1> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) [junit4:junit4] 1> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) [junit4:junit4] 1> at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) These happen right after the state is switched until both parent leader and sub shard leader have the latest cluster state. The possible fixes are: Create a new recovery strategy for sub shard replication Replicate to sub shard leader synchronously (before local update) Switch parent shard to INACTIVE first, wait for it to receive the cluster state and then switch sub shards to ACTIVE – Clients would receive failures on updates for a short time but such failures should already be handled by clients (because of host failures), we should be okay. Sub shards failures must be handled so that we always end up with the shard range being available somewhere. Thoughts? Yonik Seeley , Mark Miller , Anshum Gupta
          Hide
          Yonik Seeley added a comment - - edited

          Nice job tracking that down.

          Replicate to sub shard leader synchronously (before local update)

          This seems like the right fix (I had thought it was this way already). This should include returning failure to the client of course.

          Show
          Yonik Seeley added a comment - - edited Nice job tracking that down. Replicate to sub shard leader synchronously (before local update) This seems like the right fix (I had thought it was this way already). This should include returning failure to the client of course.
          Hide
          Shalin Shekhar Mangar added a comment -

          Changes:

          1. New syncAdd and syncDelete methods in SolrCmdDistributor which add/delete synchronously and propagate exceptions
          2. DistributedUpdateProcessor calls cmdDistrib.syncAdd inside versionAdd because that's the only place where we have the version and the full doc and an opportunity to do remote synchronous add before local add
          3. Similarly cmdDistrib.syncDelete is called by versionDelete and doDeleteByQuery
          4. ShardSplitTest tests for delete-by-id

          With these changes, any exception while forwarding updates to sub shard leaders, will result in an exception being thrown to the client. The client can then retry the operation.

          A code review would be helpful.

          Considering that without this fix, shard splitting can, in some cases, lead to data loss, we should add this to 4.3.1

          Show
          Shalin Shekhar Mangar added a comment - Changes: New syncAdd and syncDelete methods in SolrCmdDistributor which add/delete synchronously and propagate exceptions DistributedUpdateProcessor calls cmdDistrib.syncAdd inside versionAdd because that's the only place where we have the version and the full doc and an opportunity to do remote synchronous add before local add Similarly cmdDistrib.syncDelete is called by versionDelete and doDeleteByQuery ShardSplitTest tests for delete-by-id With these changes, any exception while forwarding updates to sub shard leaders, will result in an exception being thrown to the client. The client can then retry the operation. A code review would be helpful. Considering that without this fix, shard splitting can, in some cases, lead to data loss, we should add this to 4.3.1
          Hide
          Anshum Gupta added a comment -

          Looks fine to me other than one small change which I don't think is a part of your patch but would be good if fixed.

          DistributedUpdateProcessor.updateAdd(): Line 404

          if (isLeader)

          Unknown macro: { params.set("distrib.from", ZkCoreNodeProps.getCoreUrl( zkController.getBaseUrl(), req.getCore().getName())); }

          params.set("distrib.from", ZkCoreNodeProps.getCoreUrl(
          zkController.getBaseUrl(), req.getCore().getName()));

          Show
          Anshum Gupta added a comment - Looks fine to me other than one small change which I don't think is a part of your patch but would be good if fixed. DistributedUpdateProcessor.updateAdd(): Line 404 if (isLeader) Unknown macro: { params.set("distrib.from", ZkCoreNodeProps.getCoreUrl( zkController.getBaseUrl(), req.getCore().getName())); } params.set("distrib.from", ZkCoreNodeProps.getCoreUrl( zkController.getBaseUrl(), req.getCore().getName()));
          Hide
          Yonik Seeley added a comment -

          Hmmm, can we get away with fixing this with a much less invasive change?
          What if we just send to sub-shards like we send to other replicas, and return a failure if the sub-shard fails (don't worry about trying to change the logic to send to a sub-shard before adding locally). The shard is going to become inactive anyway, so it shouldn't matter if we accidentally add a document locally that goes on to be rejected by the sub-shard, right?

          Show
          Yonik Seeley added a comment - Hmmm, can we get away with fixing this with a much less invasive change? What if we just send to sub-shards like we send to other replicas, and return a failure if the sub-shard fails (don't worry about trying to change the logic to send to a sub-shard before adding locally). The shard is going to become inactive anyway, so it shouldn't matter if we accidentally add a document locally that goes on to be rejected by the sub-shard, right?
          Hide
          Shalin Shekhar Mangar added a comment -

          What if we just send to sub-shards like we send to other replicas, and return a failure if the sub-shard fails (don't worry about trying to change the logic to send to a sub-shard before adding locally). The shard is going to become inactive anyway, so it shouldn't matter if we accidentally add a document locally that goes on to be rejected by the sub-shard, right?

          What happens with partial updates in that case? Suppose an increment operation is requested which succeeds locally but is not propagated to the sub shard. If the client retries, the index will have wrong values.

          Show
          Shalin Shekhar Mangar added a comment - What if we just send to sub-shards like we send to other replicas, and return a failure if the sub-shard fails (don't worry about trying to change the logic to send to a sub-shard before adding locally). The shard is going to become inactive anyway, so it shouldn't matter if we accidentally add a document locally that goes on to be rejected by the sub-shard, right? What happens with partial updates in that case? Suppose an increment operation is requested which succeeds locally but is not propagated to the sub shard. If the client retries, the index will have wrong values.
          Hide
          Yonik Seeley added a comment -

          What happens with partial updates in that case? Suppose an increment operation is requested which succeeds locally but is not propagated to the sub shard.

          If we're talking about failures due to the sub-shard already being active when it receives an update from the old shard who thinks it's still the leader, then I think we're fine. This isn't a new failure mode, but just another way that the old shard can be out of date. For example, once a normal update is received by the new shard, the old shard will be out of date anyway.

          If the client retries, the index will have wrong values.

          If the client retries to the same old shard that is no longer the leader, then the update will fail again because the sub-shard will reject it again? We could perhaps return an error code suggesting that the client is using stale cluster state (i.e. re-read before trying the update again).

          Show
          Yonik Seeley added a comment - What happens with partial updates in that case? Suppose an increment operation is requested which succeeds locally but is not propagated to the sub shard. If we're talking about failures due to the sub-shard already being active when it receives an update from the old shard who thinks it's still the leader, then I think we're fine. This isn't a new failure mode, but just another way that the old shard can be out of date. For example, once a normal update is received by the new shard, the old shard will be out of date anyway. If the client retries, the index will have wrong values. If the client retries to the same old shard that is no longer the leader, then the update will fail again because the sub-shard will reject it again? We could perhaps return an error code suggesting that the client is using stale cluster state (i.e. re-read before trying the update again).
          Hide
          Shalin Shekhar Mangar added a comment -

          If we're talking about failures due to the sub-shard already being active when it receives an update from the old shard who thinks it's still the leader, then I think we're fine.

          Yes, that's true. I was thinking of the general failure scenario but perhaps we can ignore it because both parent and sub shard leaders are on the same JVM?

          Show
          Shalin Shekhar Mangar added a comment - If we're talking about failures due to the sub-shard already being active when it receives an update from the old shard who thinks it's still the leader, then I think we're fine. Yes, that's true. I was thinking of the general failure scenario but perhaps we can ignore it because both parent and sub shard leaders are on the same JVM?
          Hide
          Yonik Seeley added a comment -

          Yes, that's true. I was thinking of the general failure scenario but perhaps we can ignore it because both parent and sub shard leaders are on the same JVM?

          Yeah, I think that's best for now. If it actually becomes an issue (which should be really rare), we could just cancel the split and maybe retry it from the start.

          Show
          Yonik Seeley added a comment - Yes, that's true. I was thinking of the general failure scenario but perhaps we can ignore it because both parent and sub shard leaders are on the same JVM? Yeah, I think that's best for now. If it actually becomes an issue (which should be really rare), we could just cancel the split and maybe retry it from the start.
          Hide
          Shalin Shekhar Mangar added a comment -

          Okay, I'll put up a patch.

          Show
          Shalin Shekhar Mangar added a comment - Okay, I'll put up a patch.
          Hide
          Shalin Shekhar Mangar added a comment -

          Changes:

          1. Add and delete requests are processed after local operations but before distributed operations

          This patch makes no changes to the versionAdd, versionDelete methods and is much less invasive.

          Show
          Shalin Shekhar Mangar added a comment - Changes: Add and delete requests are processed after local operations but before distributed operations This patch makes no changes to the versionAdd, versionDelete methods and is much less invasive.
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed.

          trunk r1489138
          branch_4x 1489139
          lucene_solr_4_3 r1489141

          Show
          Shalin Shekhar Mangar added a comment - Committed. trunk r1489138 branch_4x 1489139 lucene_solr_4_3 r1489141
          Hide
          Hoss Man added a comment -

          this change completlye broke almost every test of solr that doesn't use solrcloud...

          org.apache.solr.common.SolrException: java.lang.NullPointerException
                  at __randomizedtesting.SeedInfo.seed([E1CD32BA68ADB059:42A275854FFF05B6]:0)
                  at org.apache.solr.util.TestHarness.update(TestHarness.java:271)
                  at org.apache.solr.util.BaseTestHarness.checkUpdateStatus(BaseTestHarness.java:261)
                  at org.apache.solr.util.BaseTestHarness.validateUpdate(BaseTestHarness.java:231)
                  at org.apache.solr.SolrTestCaseJ4.checkUpdateU(SolrTestCaseJ4.java:481)
                  at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:460)
                  at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:454)
                  at org.apache.solr.SolrTestCaseJ4.clearIndex(SolrTestCaseJ4.java:827)
                  at org.apache.solr.BasicFunctionalityTest.testDefaultFieldValues(BasicFunctionalityTest.java:623)
          ...
          Caused by: java.lang.NullPointerException
                  at
          org.apache.solr.update.processor.DistributedUpdateProcessor.doDeleteByQuery(DistributedUpdateProcessor.java:872)
                  at
          org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:774)
                  at org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:121)
                  at org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:346)
                  at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:277)
                  at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
                  at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
                  at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
                  at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
                  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1843)
                  at org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:131)
                  at org.apache.solr.util.TestHarness.update(TestHarness.java:267)
          

          Invetigating.

          Show
          Hoss Man added a comment - this change completlye broke almost every test of solr that doesn't use solrcloud... org.apache.solr.common.SolrException: java.lang.NullPointerException at __randomizedtesting.SeedInfo.seed([E1CD32BA68ADB059:42A275854FFF05B6]:0) at org.apache.solr.util.TestHarness.update(TestHarness.java:271) at org.apache.solr.util.BaseTestHarness.checkUpdateStatus(BaseTestHarness.java:261) at org.apache.solr.util.BaseTestHarness.validateUpdate(BaseTestHarness.java:231) at org.apache.solr.SolrTestCaseJ4.checkUpdateU(SolrTestCaseJ4.java:481) at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:460) at org.apache.solr.SolrTestCaseJ4.assertU(SolrTestCaseJ4.java:454) at org.apache.solr.SolrTestCaseJ4.clearIndex(SolrTestCaseJ4.java:827) at org.apache.solr.BasicFunctionalityTest.testDefaultFieldValues(BasicFunctionalityTest.java:623) ... Caused by: java.lang.NullPointerException at org.apache.solr.update.processor.DistributedUpdateProcessor.doDeleteByQuery(DistributedUpdateProcessor.java:872) at org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:774) at org.apache.solr.update.processor.LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:121) at org.apache.solr.handler.loader.XMLLoader.processDelete(XMLLoader.java:346) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:277) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1843) at org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:131) at org.apache.solr.util.TestHarness.update(TestHarness.java:267) Invetigating.
          Hide
          Hoss Man added a comment -

          patch that seems to fix the NPE currenlty happening in 250+Solr tests w/o causing the shard splitting test to break (still running full suite to be certain).

          Show
          Hoss Man added a comment - patch that seems to fix the NPE currenlty happening in 250+Solr tests w/o causing the shard splitting test to break (still running full suite to be certain).
          Hide
          Hoss Man added a comment -

          FWIW... full test on trunk seem to pass with my patch, so i've start committing and backporting.

          i have no idea if this fix is "correct' but it is certainly better then what we had.

          I'll be leaving this issue open for Shalin to review & correct as needed.

          Show
          Hoss Man added a comment - FWIW... full test on trunk seem to pass with my patch, so i've start committing and backporting. i have no idea if this fix is "correct' but it is certainly better then what we had. I'll be leaving this issue open for Shalin to review & correct as needed.
          Hide
          Hoss Man added a comment -

          Committed revision 1489222.
          Committed revision 1489224.
          Committed revision 1489231.

          i have no idea if this fix is "correct' but it is certainly better then what we had.

          I'll be leaving this issue open for Shalin to review & correct as needed.

          Show
          Hoss Man added a comment - Committed revision 1489222. Committed revision 1489224. Committed revision 1489231. i have no idea if this fix is "correct' but it is certainly better then what we had. I'll be leaving this issue open for Shalin to review & correct as needed.
          Hide
          Shalin Shekhar Mangar added a comment -

          Sorry for breaking the build. Lesson learned. I'm on a flight to Mumbai, will take a look once I land in a couple of hours.

          Thanks Hoss for fixing this.

          Show
          Shalin Shekhar Mangar added a comment - Sorry for breaking the build. Lesson learned. I'm on a flight to Mumbai, will take a look once I land in a couple of hours. Thanks Hoss for fixing this.
          Hide
          Yonik Seeley added a comment -

          Your changes look fine Hoss.

          It's not clear to me why the forward to subshard needs to be synchronous in the original committed patch, but I guess that can always be revisited later as an optimization.

          Show
          Yonik Seeley added a comment - Your changes look fine Hoss. It's not clear to me why the forward to subshard needs to be synchronous in the original committed patch, but I guess that can always be revisited later as an optimization.
          Hide
          Shalin Shekhar Mangar added a comment -

          It's not clear to me why the forward to subshard needs to be synchronous in the original committed patch, but I guess that can always be revisited later as an optimization.

          It does not need to be that way. The syncAdd and syncDelete method seemed to be the most straight forward way to achieve the goal.

          Show
          Shalin Shekhar Mangar added a comment - It's not clear to me why the forward to subshard needs to be synchronous in the original committed patch, but I guess that can always be revisited later as an optimization. It does not need to be that way. The syncAdd and syncDelete method seemed to be the most straight forward way to achieve the goal.
          Hide
          Shalin Shekhar Mangar added a comment -

          Bulk close after 4.3.1 release

          Show
          Shalin Shekhar Mangar added a comment - Bulk close after 4.3.1 release

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development