Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8738

invalid DBQ initially sent to a non-leader node will report success

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5.1, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Discovered this while working on SOLR-445.

      If a Delete By Query gets sent to a node which is not hosting a leader (ie: only hosts replicas, or doesn't host any cores related to the specified collection) then a success will be returned, even if the DBQ is completely malformed and actually failed.

      1. SOLR-8738.patch
        12 kB
        Hoss Man
      2. SOLR-8738.patch
        13 kB
        Hoss Man
      3. SOLR-8738.patch
        12 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          hossman Hoss Man added a comment -

          The most trivial/obvious way to reproduce is...

          • bin/solr -e cloud
          • Pick "3" for number of nodes
          • accept the default port numbers (8983, 7574, 8984)
          • accept the default collection name (gettingstarted)
          • pick "1" for the number of shards
          • accept the default number of replicas per shard (2)
          • accept the default config set (data_driven_schema_configs)

          (So now you should have a single collection with a single shard with 2 replicas on 2 diff nodes and the remaining node doesn't host any cores related to the collection)

          Now try running a broken DBQ against all 3 nodes...

          $ curl -H 'Content-Type: application/json' 'http://127.0.1.1:8983/solr/gettingstarted/update' --data-binary '{"delete":{"query" : "foo_i:yak"}}'
          {"responseHeader":{"status":400,"QTime":18},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Invalid Number: yak","code":400}}
          $ curl -H 'Content-Type: application/json' 'http://127.0.1.1:8984/solr/gettingstarted/update' --data-binary '{"delete":{"query" : "foo_i:yak"}}'
          {"responseHeader":{"status":0,"QTime":25}}
          $ curl -H 'Content-Type: application/json' 'http://127.0.1.1:7574/solr/gettingstarted/update' --data-binary '{"delete":{"query" : "foo_i:yak"}}'
          {"responseHeader":{"status":0,"QTime":7}}
          

          ...only the node hosting the leader correctly repsponds back with the error, requests that initially hit nodes only hosting replicas or not hosting any cores incorrectly indicate that the delete succeeded.


          2 important notes:

          1. This can also be reproduced using numShards > 1, most easily by running -e cloud and choosing 4 nodes, and accepting the default 2 shards, 2 replicas. Then repeat the same curl commands above over all 4 ports.
            • you should see 2 nodes correctly return failures, and 2 nodes incorrectly claim success
          2. You can also reproduce using -e cloud -noprompt but since that that defaults to only 2 nodes they are garunteed to each have a leader on them, so you have to be more explicit about the requests.
            • Use the Solr UI to determine the non-leader core_node_names (ex: gettingstarted_shard1_replica1) and which node they are located on, then use those in url instead of the simple collection name (otherwise smple collection paths will be auto-route to a leader on each node)
          Show
          hossman Hoss Man added a comment - The most trivial/obvious way to reproduce is... bin/solr -e cloud Pick " 3 " for number of nodes accept the default port numbers (8983, 7574, 8984) accept the default collection name (gettingstarted) pick " 1 " for the number of shards accept the default number of replicas per shard (2) accept the default config set (data_driven_schema_configs) (So now you should have a single collection with a single shard with 2 replicas on 2 diff nodes and the remaining node doesn't host any cores related to the collection) Now try running a broken DBQ against all 3 nodes... $ curl -H 'Content-Type: application/json' 'http://127.0.1.1:8983/solr/gettingstarted/update' --data-binary '{"delete":{"query" : "foo_i:yak"}}' {"responseHeader":{"status":400,"QTime":18},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Invalid Number: yak","code":400}} $ curl -H 'Content-Type: application/json' 'http://127.0.1.1:8984/solr/gettingstarted/update' --data-binary '{"delete":{"query" : "foo_i:yak"}}' {"responseHeader":{"status":0,"QTime":25}} $ curl -H 'Content-Type: application/json' 'http://127.0.1.1:7574/solr/gettingstarted/update' --data-binary '{"delete":{"query" : "foo_i:yak"}}' {"responseHeader":{"status":0,"QTime":7}} ...only the node hosting the leader correctly repsponds back with the error, requests that initially hit nodes only hosting replicas or not hosting any cores incorrectly indicate that the delete succeeded. 2 important notes: This can also be reproduced using numShards > 1 , most easily by running -e cloud and choosing 4 nodes, and accepting the default 2 shards, 2 replicas. Then repeat the same curl commands above over all 4 ports. you should see 2 nodes correctly return failures, and 2 nodes incorrectly claim success You can also reproduce using -e cloud -noprompt but since that that defaults to only 2 nodes they are garunteed to each have a leader on them, so you have to be more explicit about the requests. Use the Solr UI to determine the non-leader core_node_names (ex: gettingstarted_shard1_replica1 ) and which node they are located on, then use those in url instead of the simple collection name (otherwise smple collection paths will be auto-route to a leader on each node)
          Hide
          hossman Hoss Man added a comment -

          here's a MiniSolrCloudCluster based test patch i hacked out of the existing SOLR-445 work demonstrating the problem.

          still tracing trough to try and figure out why the errors aren't getting propogated.

          Show
          hossman Hoss Man added a comment - here's a MiniSolrCloudCluster based test patch i hacked out of the existing SOLR-445 work demonstrating the problem. still tracing trough to try and figure out why the errors aren't getting propogated.
          Hide
          hossman Hoss Man added a comment -

          The crux of why the errors are silently ignored seems to be because DUP.doFinish() only logs a WARN – but does not propogate – any error returned by cmdDistrib.getErrors(); unless the type of Node involved in the request was a RetryNode. The reason given for this being...

          // else
          // for now we don't error - we assume if it was added locally, we
          // succeeded 
          

          The problem aparently being that when a DBQ is forwarded to all leaders, StdNode is (currently) used – but there was no local operation executed, only the forward to the leaders, so there is no local success/failure.


          The attached patch changes the DBQ propagation logic to use RetryNode – i'm still running full tests, but at a minimum it makes the new TestCloudDeleteByQuery in the patch start passing.

          i don't fully understand the entire ramifications of this change, particularly as it relates to rest of the code in DUP.doFinish and things like forcing leader recovery, but based on the comments on the StdNode / RetryNode classes and the other uses of StdNode / RetryNodeRetryNode (notably: STD->replica vs RETRY->leader) this seems like the most correct fix in general.

          Show
          hossman Hoss Man added a comment - The crux of why the errors are silently ignored seems to be because DUP.doFinish() only logs a WARN – but does not propogate – any error returned by cmdDistrib.getErrors(); unless the type of Node involved in the request was a RetryNode . The reason given for this being... // else // for now we don't error - we assume if it was added locally, we // succeeded The problem aparently being that when a DBQ is forwarded to all leaders, StdNode is (currently) used – but there was no local operation executed, only the forward to the leaders, so there is no local success/failure. The attached patch changes the DBQ propagation logic to use RetryNode – i'm still running full tests, but at a minimum it makes the new TestCloudDeleteByQuery in the patch start passing. i don't fully understand the entire ramifications of this change, particularly as it relates to rest of the code in DUP.doFinish and things like forcing leader recovery, but based on the comments on the StdNode / RetryNode classes and the other uses of StdNode / RetryNode RetryNode (notably: STD->replica vs RETRY->leader) this seems like the most correct fix in general.
          Hide
          hossman Hoss Man added a comment -

          Updated patch to refactor test to take advantage of SolrCloudTestCase added by SOLR-8758.

          Still need someone who actually understands this code to review the non-test changes.

          Show
          hossman Hoss Man added a comment - Updated patch to refactor test to take advantage of SolrCloudTestCase added by SOLR-8758 . Still need someone who actually understands this code to review the non-test changes.
          Hide
          noble.paul Noble Paul added a comment -

          It is easy to understand why and how this fix works.
          But it's kinda hard to understand why DUP is written to look at classname to propagate the errors.

          +1 to commit

          Show
          noble.paul Noble Paul added a comment - It is easy to understand why and how this fix works. But it's kinda hard to understand why DUP is written to look at classname to propagate the errors. +1 to commit
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ff6557cbcb5308d60f17114de1d0ad29003e9668 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter (Unused)
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ff6557c ]

          SOLR-8738: Fixed false success response when invalid deleteByQuery requests intially hit non-leader cloud nodes

          Show
          jira-bot ASF subversion and git services added a comment - Commit ff6557cbcb5308d60f17114de1d0ad29003e9668 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter (Unused) [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ff6557c ] SOLR-8738 : Fixed false success response when invalid deleteByQuery requests intially hit non-leader cloud nodes
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 2401c9495319e1b5065b05ef3a36ee586f06b6d4 in lucene-solr's branch refs/heads/jira/SOLR-445 from Chris Hostetter (Unused)
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2401c94 ]

          SOLR-445 Merge branch 'master' into jira/SOLR-445 (pick up SOLR-8738 changes)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 2401c9495319e1b5065b05ef3a36ee586f06b6d4 in lucene-solr's branch refs/heads/jira/ SOLR-445 from Chris Hostetter (Unused) [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2401c94 ] SOLR-445 Merge branch 'master' into jira/ SOLR-445 (pick up SOLR-8738 changes)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ff6557cbcb5308d60f17114de1d0ad29003e9668 in lucene-solr's branch refs/heads/master from Chris Hostetter (Unused)
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ff6557c ]

          SOLR-8738: Fixed false success response when invalid deleteByQuery requests intially hit non-leader cloud nodes

          Show
          jira-bot ASF subversion and git services added a comment - Commit ff6557cbcb5308d60f17114de1d0ad29003e9668 in lucene-solr's branch refs/heads/master from Chris Hostetter (Unused) [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ff6557c ] SOLR-8738 : Fixed false success response when invalid deleteByQuery requests intially hit non-leader cloud nodes
          Hide
          anshumg Anshum Gupta added a comment -

          Reopening for 5.5.1

          Show
          anshumg Anshum Gupta added a comment - Reopening for 5.5.1
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 9e77319abcf3ff372b86cec4d66bac11f7e038b6 in lucene-solr's branch refs/heads/branch_5x from Chris Hostetter (Unused)
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9e77319 ]

          SOLR-8738: Fixed false success response when invalid deleteByQuery requests intially hit non-leader cloud nodes

          Show
          jira-bot ASF subversion and git services added a comment - Commit 9e77319abcf3ff372b86cec4d66bac11f7e038b6 in lucene-solr's branch refs/heads/branch_5x from Chris Hostetter (Unused) [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9e77319 ] SOLR-8738 : Fixed false success response when invalid deleteByQuery requests intially hit non-leader cloud nodes
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 66d3c2eb0a1e7b28621557c87c8d5b5219a95add in lucene-solr's branch refs/heads/branch_5_5 from Chris Hostetter (Unused)
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=66d3c2e ]

          SOLR-8738: Fixed false success response when invalid deleteByQuery requests intially hit non-leader cloud nodes

          Show
          jira-bot ASF subversion and git services added a comment - Commit 66d3c2eb0a1e7b28621557c87c8d5b5219a95add in lucene-solr's branch refs/heads/branch_5_5 from Chris Hostetter (Unused) [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=66d3c2e ] SOLR-8738 : Fixed false success response when invalid deleteByQuery requests intially hit non-leader cloud nodes

            People

            • Assignee:
              hossman Hoss Man
              Reporter:
              hossman Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development