Solr
  1. Solr
  2. SOLR-5650

When mixing adds and deletes, it appears there is a corner case where peersync can bring back a deleted update.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: SolrCloud
    • Labels:
      None
    1. solr.log.tar.gz
      226 kB
      Mark Miller
    2. SOLR-5650.patch
      8 kB
      Mark Miller
    3. SOLR-5650.patch
      5 kB
      Mark Miller

      Issue Links

        Activity

        Hide
        Mark Miller added a comment -
           [junit4]   2> 67110 T26 C64 P38462 oasup.LogUpdateProcessor.processAdd DEBUG PRE_UPDATE add{,id=0-221} {version=2&wt=javabin&CONTROL=TRUE}
           [junit4]   2> 67111 T26 C64 P38462 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={version=2&wt=javabin&CONTROL=TRUE} {add=[0-221 (1457809507133423616)]} 0 2
           [junit4]   2> 67114 T67 C65 P58989 oasup.LogUpdateProcessor.processAdd DEBUG PRE_UPDATE add{,id=0-221} {version=2&wt=javabin}
           [junit4]   2> 67115 T67 C65 P58989 oasu.SolrCmdDistributor.submit DEBUG sending update to http://127.0.0.1:58081/collection1/ retry:0 add{_version_=1457809507137617920,id=0-221} params:update.distrib=FROMLEADER&distrib.from=http%3A%2F%2F127.0.0.1%3A58989%2Fcollection1%2F
           [junit4]   2> 67117 T141 C66 P58081 oasup.LogUpdateProcessor.processAdd DEBUG PRE_UPDATE add{,id=0-221} {version=2&wt=javabin&distrib.from=http://127.0.0.1:58989/collection1/&update.distrib=FROMLEADER}
           [junit4]   2> 67119 T141 C66 P58081 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={version=2&wt=javabin&distrib.from=http://127.0.0.1:58989/collection1/&update.distrib=FROMLEADER} {add=[0-221 (1457809507137617920)]} 0 2
           [junit4]   2> 67119 T67 C65 P58989 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={version=2&wt=javabin} {add=[0-221 (1457809507137617920)]} 0 5
           [junit4]   2> 71842 T26 C83 P38462 oasup.LogUpdateProcessor.processDelete DEBUG PRE_UPDATE delete{,id=0-221,commitWithin=-1} {version=2&wt=javabin&CONTROL=TRUE}
           [junit4]   2> 71843 T26 C83 P38462 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={version=2&wt=javabin&CONTROL=TRUE} {delete=[0-221 (-1457809512096333824)]} 0 1
           [junit4]   2> 71847 T141 C82 P58081 oasup.LogUpdateProcessor.processDelete DEBUG PRE_UPDATE delete{,id=0-221,commitWithin=-1} {version=2&wt=javabin}
           [junit4]   2> 71848 T141 C82 P58081 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={version=2&wt=javabin} {delete=[0-221 (-1457809512100528128)]} 0 2
           [junit4]   2> 92831 T241 C393 P53946 oasup.LogUpdateProcessor.processAdd DEBUG PRE_UPDATE add{flags=c,_version_=1457809507137617920,id=0-221} LocalSolrQueryRequest{update.distrib=FROMLEADER&peersync=true}
           [junit4]   2> 98890 T356 C441 P58081 oasup.LogUpdateProcessor.processAdd DEBUG PRE_UPDATE add{flags=c,_version_=1457809507137617920,id=0-221} LocalSolrQueryRequest{update.distrib=FROMLEADER&peersync=true}
           [junit4]   2> 98901 T356 C441 P58081 oasup.LogUpdateProcessor.finish [collection1] {add=[0-209 (1457809502219796480), 0-217 (1457809503008325632), 0-219 (1457809503204409344), 0-221 (1457809507137617920), 0-223 (1457809507365158912), 0-225 (1457809507605282816), 0-410 (1457809535876988928), 0-414 (1457809536131792896), 0-416 (1457809536277544960), 0-418 (1457809536412811264), ... (14 adds)],delete=[0-194 (-1457809502331994112), ft1-107 (-1457809502356111360), ft1-108 (-1457809502633984000), 0-196 (-1457809502648664064), 0-206 (-1457809502995742720), ft1-112 (-1457809503185534976), 0-209 (-1457809507200532480), 0-217 (-1457809507535028224), 0-219 (-1457809507773054976), 0-408 (-1457809535916834816), ... (14 deletes)]} 0 19
           [junit4]   2> ###### Only in http://127.0.0.1:53946/collection1: [{_version_=1457809507137617920, id=0-221}, {_version_=1457809507605282816, id=0-225}, {_version_=1457809507365158912, id=0-223}]
        
        
        Show
        Mark Miller added a comment - [junit4] 2> 67110 T26 C64 P38462 oasup.LogUpdateProcessor.processAdd DEBUG PRE_UPDATE add{,id=0-221} {version=2&wt=javabin&CONTROL=TRUE} [junit4] 2> 67111 T26 C64 P38462 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={version=2&wt=javabin&CONTROL=TRUE} {add=[0-221 (1457809507133423616)]} 0 2 [junit4] 2> 67114 T67 C65 P58989 oasup.LogUpdateProcessor.processAdd DEBUG PRE_UPDATE add{,id=0-221} {version=2&wt=javabin} [junit4] 2> 67115 T67 C65 P58989 oasu.SolrCmdDistributor.submit DEBUG sending update to http://127.0.0.1:58081/collection1/ retry:0 add{_version_=1457809507137617920,id=0-221} params:update.distrib=FROMLEADER&distrib.from=http%3A%2F%2F127.0.0.1%3A58989%2Fcollection1%2F [junit4] 2> 67117 T141 C66 P58081 oasup.LogUpdateProcessor.processAdd DEBUG PRE_UPDATE add{,id=0-221} {version=2&wt=javabin&distrib.from=http://127.0.0.1:58989/collection1/&update.distrib=FROMLEADER} [junit4] 2> 67119 T141 C66 P58081 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={version=2&wt=javabin&distrib.from=http://127.0.0.1:58989/collection1/&update.distrib=FROMLEADER} {add=[0-221 (1457809507137617920)]} 0 2 [junit4] 2> 67119 T67 C65 P58989 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={version=2&wt=javabin} {add=[0-221 (1457809507137617920)]} 0 5 [junit4] 2> 71842 T26 C83 P38462 oasup.LogUpdateProcessor.processDelete DEBUG PRE_UPDATE delete{,id=0-221,commitWithin=-1} {version=2&wt=javabin&CONTROL=TRUE} [junit4] 2> 71843 T26 C83 P38462 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={version=2&wt=javabin&CONTROL=TRUE} {delete=[0-221 (-1457809512096333824)]} 0 1 [junit4] 2> 71847 T141 C82 P58081 oasup.LogUpdateProcessor.processDelete DEBUG PRE_UPDATE delete{,id=0-221,commitWithin=-1} {version=2&wt=javabin} [junit4] 2> 71848 T141 C82 P58081 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={version=2&wt=javabin} {delete=[0-221 (-1457809512100528128)]} 0 2 [junit4] 2> 92831 T241 C393 P53946 oasup.LogUpdateProcessor.processAdd DEBUG PRE_UPDATE add{flags=c,_version_=1457809507137617920,id=0-221} LocalSolrQueryRequest{update.distrib=FROMLEADER&peersync=true} [junit4] 2> 98890 T356 C441 P58081 oasup.LogUpdateProcessor.processAdd DEBUG PRE_UPDATE add{flags=c,_version_=1457809507137617920,id=0-221} LocalSolrQueryRequest{update.distrib=FROMLEADER&peersync=true} [junit4] 2> 98901 T356 C441 P58081 oasup.LogUpdateProcessor.finish [collection1] {add=[0-209 (1457809502219796480), 0-217 (1457809503008325632), 0-219 (1457809503204409344), 0-221 (1457809507137617920), 0-223 (1457809507365158912), 0-225 (1457809507605282816), 0-410 (1457809535876988928), 0-414 (1457809536131792896), 0-416 (1457809536277544960), 0-418 (1457809536412811264), ... (14 adds)],delete=[0-194 (-1457809502331994112), ft1-107 (-1457809502356111360), ft1-108 (-1457809502633984000), 0-196 (-1457809502648664064), 0-206 (-1457809502995742720), ft1-112 (-1457809503185534976), 0-209 (-1457809507200532480), 0-217 (-1457809507535028224), 0-219 (-1457809507773054976), 0-408 (-1457809535916834816), ... (14 deletes)]} 0 19 [junit4] 2> ###### Only in http://127.0.0.1:53946/collection1: [{_version_=1457809507137617920, id=0-221}, {_version_=1457809507605282816, id=0-225}, {_version_=1457809507365158912, id=0-223}]
        Hide
        Mark Miller added a comment -

        Got this fail after running the chaosmonkey tests for about 2 days on a continuous loop. The doc is brought back by peersync, but not too all of the replicas in the shard either.

        Show
        Mark Miller added a comment - Got this fail after running the chaosmonkey tests for about 2 days on a continuous loop. The doc is brought back by peersync, but not too all of the replicas in the shard either.
        Hide
        Mark Miller added a comment -

        One good improvement:

        For the case were the leader is a replacement, only peer sync with those nodes that have last published ACTIVE.

        Show
        Mark Miller added a comment - One good improvement: For the case were the leader is a replacement, only peer sync with those nodes that have last published ACTIVE.
        Hide
        Mark Miller added a comment -

        You will still have the possibility of bringing back documents on startup by syncing with a replica that is out of date by less than 100 updates on cluster startup. The number of recent docs that would come back is fairly small though, and the chances of it happening pretty slim. We may be able to do more around that as well.

        Show
        Mark Miller added a comment - You will still have the possibility of bringing back documents on startup by syncing with a replica that is out of date by less than 100 updates on cluster startup. The number of recent docs that would come back is fairly small though, and the chances of it happening pretty slim. We may be able to do more around that as well.
        Hide
        ASF subversion and git services added a comment -

        Commit 1560221 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1560221 ]

        SOLR-5650: When a replica becomes a leader, only peer sync with other replicas that last published an ACTIVE state.

        Show
        ASF subversion and git services added a comment - Commit 1560221 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1560221 ] SOLR-5650 : When a replica becomes a leader, only peer sync with other replicas that last published an ACTIVE state.
        Hide
        ASF subversion and git services added a comment -

        Commit 1560223 from Mark Miller in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1560223 ]

        SOLR-5650: When a replica becomes a leader, only peer sync with other replicas that last published an ACTIVE state.

        Show
        ASF subversion and git services added a comment - Commit 1560223 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1560223 ] SOLR-5650 : When a replica becomes a leader, only peer sync with other replicas that last published an ACTIVE state.
        Hide
        Mark Miller added a comment -

        Just saw another fail where the control had a doc the cluster did not. This issue may have helped hide it previously, don't remember seeing it before. Actually looks like a legit fail. Looks like a single leader in a shard accepted an update and was killed. The update is gone, and so the loss makes sense in the current system. You need support for the param that tells how many replicas to the update must get to for success to protect against this.

        Show
        Mark Miller added a comment - Just saw another fail where the control had a doc the cluster did not. This issue may have helped hide it previously, don't remember seeing it before. Actually looks like a legit fail. Looks like a single leader in a shard accepted an update and was killed. The update is gone, and so the loss makes sense in the current system. You need support for the param that tells how many replicas to the update must get to for success to protect against this.
        Hide
        ASF subversion and git services added a comment -

        Commit 1595547 from Ryan Ernst in branch 'dev/branches/lucene5650'
        [ https://svn.apache.org/r1595547 ]

        SOLR-5650: add changes entries

        Show
        ASF subversion and git services added a comment - Commit 1595547 from Ryan Ernst in branch 'dev/branches/lucene5650' [ https://svn.apache.org/r1595547 ] SOLR-5650 : add changes entries

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development