Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9207

PeerSync recovery fails if number of updates requested is high

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 5.1, 6.0
    • Fix Version/s: 6.2, 7.0
    • Component/s: None
    • Labels:
      None

      Description

      PeerSync recovery fails if we request more than ~99K updates.

      If update solrconfig to retain more tlogs to leverage https://issues.apache.org/jira/browse/SOLR-6359

      During out testing we found out that recovery using PeerSync fails if we ask for more than ~99K updates, with following error

       WARN  PeerSync [RecoveryThread] - PeerSync: core=hold_shard1 url=<shardUrl>
      exception talking to <leaderUrl>, failed
      org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got application/xml. 
      <?xml version="1.0" encoding="UTF-8"?>
      <response>
      <lst name="error"><str name="msg">application/x-www-form-urlencoded content length (4761994 bytes) exceeds upload limit of 2048 KB</str><in
      t name="code">400</int></lst>
      </response>
      

      We arrived at ~99K with following match

      • max_version_number = Long.MAX_VALUE = 9223372036854775807
      • bytes per version number = 20 (on the wire as POST request sends version number as string)
      • additional bytes for separator ,
      • max_versions_in_single_request = 2MB/21 = ~99864

      I could think of 2 ways to fix it
      1. Ask for about updates in chunks of 90K inside PeerSync.requestUpdates()

      2. Use application/octet-stream encoding

      1. SOLR-9207.patch
        17 kB
        Shalin Shekhar Mangar
      2. SOLR-9207.patch_updated
        16 kB
        Pushkar Raste
      3. SOLR-9207.patch
        17 kB
        Pushkar Raste

        Issue Links

          Activity

          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks Pushkar. PeerSync doesn't stream so this is not surprising. Which solution have you implemented in the patch? A rough description would go a long way. Also, there are some unrelated changes in TSTLookup which don't belong here. A workaround would be to increase the default upload limit in Jetty.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks Pushkar. PeerSync doesn't stream so this is not surprising. Which solution have you implemented in the patch? A rough description would go a long way. Also, there are some unrelated changes in TSTLookup which don't belong here. A workaround would be to increase the default upload limit in Jetty.
          Hide
          praste Pushkar Raste added a comment -

          Here is high level description

          PeerSync currently computes versions the node recovery is missing and then sends all the version numbers to a replica to get corresponding updates. When a node under recovery is missing too many updates, the payload of getUpdates goes above 2MB and jetty would reject the request. Problem can be solved using one of the following technique

          1. Increasing jetty payload limit pay solve this problem. We still would be sending a lot of data over the network, which might not be needed.
          2. Stream versions to replica while asking for updates.
          3. Request versions in chunks of about 90K versions at a time
          4. gzip versions , and unzip it on the other side.
          5. Ask for version using version ranges instead of sending individual versions.

          Approaches 1-3 require sending lot of data over the wire.
          Approach #3 also requires making multiple calls. Additionally #3 might not be feasible consider how current code works by submitting requests to shardHandler and calling handleResponse.
          #4 may work, but looks a little inelegant.

          Hence I settle on approach #5 (suggested by Ramkumar). Here is how it works

          • Let's say replica has version [1, 2, 3, 4, 5, 6] and leader has versions [1, 2, 3, 4, 5, 6, 10, -11, 12, 13, 15, 18]
          • While recovery using PeerSync strategy, replica computes, that range it is missing is 10...18
          • Replica now requests for versions by specifying range 10...18 instead of sending all the individual versions (namely 10,11,-11,12,13,15,18)
          • I have made using version ranges for PeerSync configurable, by introducing following configuration section
              <peerSync>
                <str name="useRangeVersions">${solr.peerSync.useRangeVersions:false}</str>
              </peerSync>
            
          • Further I have it backwards compatible and a recovering node will use version ranges only if node it asks for updates can process version ranges
          Show
          praste Pushkar Raste added a comment - Here is high level description PeerSync currently computes versions the node recovery is missing and then sends all the version numbers to a replica to get corresponding updates. When a node under recovery is missing too many updates, the payload of getUpdates goes above 2MB and jetty would reject the request. Problem can be solved using one of the following technique Increasing jetty payload limit pay solve this problem. We still would be sending a lot of data over the network, which might not be needed. Stream versions to replica while asking for updates. Request versions in chunks of about 90K versions at a time gzip versions , and unzip it on the other side. Ask for version using version ranges instead of sending individual versions. Approaches 1-3 require sending lot of data over the wire. Approach #3 also requires making multiple calls. Additionally #3 might not be feasible consider how current code works by submitting requests to shardHandler and calling handleResponse . #4 may work, but looks a little inelegant. Hence I settle on approach #5 (suggested by Ramkumar). Here is how it works Let's say replica has version [1, 2, 3, 4, 5, 6] and leader has versions [1, 2, 3, 4, 5, 6, 10, -11, 12, 13, 15, 18] While recovery using PeerSync strategy, replica computes, that range it is missing is 10...18 Replica now requests for versions by specifying range 10...18 instead of sending all the individual versions (namely 10,11,-11,12,13,15,18) I have made using version ranges for PeerSync configurable, by introducing following configuration section <peerSync> <str name= "useRangeVersions" >${solr.peerSync.useRangeVersions: false }</str> </peerSync> Further I have it backwards compatible and a recovering node will use version ranges only if node it asks for updates can process version ranges
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks Pushkar! The patch looks good to me except that the testing is not adequate. The one test modified by the patch is RecoveryAfterSoftCommitTest which is designed to never trigger PeerSync at all. I think we should enable useRangeVersions by default and randomly set it to false during tests for good coverage.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks Pushkar! The patch looks good to me except that the testing is not adequate. The one test modified by the patch is RecoveryAfterSoftCommitTest which is designed to never trigger PeerSync at all. I think we should enable useRangeVersions by default and randomly set it to false during tests for good coverage.
          Hide
          praste Pushkar Raste added a comment -

          Thanks a lot Shalin.
          I will make the suggested change for randomized testing.

          Show
          praste Pushkar Raste added a comment - Thanks a lot Shalin. I will make the suggested change for randomized testing.
          Hide
          praste Pushkar Raste added a comment -

          Shalin Shekhar Mangar - Please check updated patch.

          Show
          praste Pushkar Raste added a comment - Shalin Shekhar Mangar - Please check updated patch.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment - - edited

          Changes:

          1. The value for useRangeVersions being set in solrconfig.xml wasn't being read at all because it was written in solrconfig.xml with the element 'str' but it was being read as 'useRangeVersions'. I changed the element name in configuration to useRangeVersions to make it work.
          2. The value for useRangeVersions should be in EditableSolrConfigAttributes.json so that it can be changed via the config API
          3. Similarly, useRangeVersions should be returned in SolrConfig.toMap so that its value is returned by the config API
          4. System property set in SolrTestCase4J for useRangeVersions should be cleared in the tear down method

          I'll run precommit + tests and commit if there are no surprises.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - - edited Changes: The value for useRangeVersions being set in solrconfig.xml wasn't being read at all because it was written in solrconfig.xml with the element 'str' but it was being read as 'useRangeVersions'. I changed the element name in configuration to useRangeVersions to make it work. The value for useRangeVersions should be in EditableSolrConfigAttributes.json so that it can be changed via the config API Similarly, useRangeVersions should be returned in SolrConfig.toMap so that its value is returned by the config API System property set in SolrTestCase4J for useRangeVersions should be cleared in the tear down method I'll run precommit + tests and commit if there are no surprises.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 380c5a6b9727beabb8ccce04add7e8e7089aa798 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=380c5a6 ]

          SOLR-9207: PeerSync recovery failes if number of updates requested is high. A new useRangeVersions config option is introduced (defaults to true) to send version ranges instead of individual versions for peer sync.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 380c5a6b9727beabb8ccce04add7e8e7089aa798 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=380c5a6 ] SOLR-9207 : PeerSync recovery failes if number of updates requested is high. A new useRangeVersions config option is introduced (defaults to true) to send version ranges instead of individual versions for peer sync.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a942de68fc34602ad0640a2726fd3dd240352357 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a942de6 ]

          SOLR-9207: PeerSync recovery failes if number of updates requested is high. A new useRangeVersions config option is introduced (defaults to true) to send version ranges instead of individual versions for peer sync.
          (cherry picked from commit 380c5a6)

          Show
          jira-bot ASF subversion and git services added a comment - Commit a942de68fc34602ad0640a2726fd3dd240352357 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a942de6 ] SOLR-9207 : PeerSync recovery failes if number of updates requested is high. A new useRangeVersions config option is introduced (defaults to true) to send version ranges instead of individual versions for peer sync. (cherry picked from commit 380c5a6)
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks Pushkar!

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks Pushkar!
          Hide
          mikemccand Michael McCandless added a comment -

          Bulk close resolved issues after 6.2.0 release.

          Show
          mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

            People

            • Assignee:
              shalinmangar Shalin Shekhar Mangar
              Reporter:
              praste Pushkar Raste
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development