Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10114

Reordered delete-by-query can delete or omit child documents

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.5
    • Fix Version/s: 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      It looks like when a block of documents is indexed, child documents get no _version_ field. This means (among other potential issues) that a delete-by-query that is reordered will cause matching child documents to be deleted. DBQ normally prevents deleting newer docs by including a restriction on _version_, which doesn't work for anything lacking that field. Re-ordered delete-by-term of any child docs would also be affected (although it should be a much rarer issue.)

      The leading candidate for a fix is to use the exact same _version_ for all child docs.

      1. SOLR-10114.patch
        12 kB
        Mano Kovacs
      2. SOLR-10114.patch
        12 kB
        Mano Kovacs
      3. SOLR-10114-2.patch
        12 kB
        Mano Kovacs
      4. SOLR-10114-2.patch
        13 kB
        Mano Kovacs
      5. SOLR-10114-3.patch
        8 kB
        Mano Kovacs
      6. SOLR-10114-test-cleanup.patch
        1 kB
        Mano Kovacs
      7. SOLR-10114-validation.patch
        8 kB
        Mano Kovacs

        Activity

        Hide
        manokovacs Mano Kovacs added a comment -

        Yonik Seeley, confirmed by running parallel insert/updates and deletes of documents with child documents. Replication gets out of sync eventually.

        Show
        manokovacs Mano Kovacs added a comment - Yonik Seeley , confirmed by running parallel insert/updates and deletes of documents with child documents. Replication gets out of sync eventually.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Great!
        A simple non-concurrent way to reproduce would be to fake reordering (by faking updates from a leader).
        Lowest level examples of this are in TestRecovery.testLogReplayWithReorderedDBQ
        or at slightly higher levels in PeerSyncTest and others (look for users of DistribPhase.FROMLEADER for more examples)

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Great! A simple non-concurrent way to reproduce would be to fake reordering (by faking updates from a leader). Lowest level examples of this are in TestRecovery.testLogReplayWithReorderedDBQ or at slightly higher levels in PeerSyncTest and others (look for users of DistribPhase.FROMLEADER for more examples)
        Hide
        manokovacs Mano Kovacs added a comment -

        Yonik Seeley, thanks for the hints, makes it much easier to test it. I am preparing the tests first, then make them pass with a fix. I was wondering, if the fix is to store the version with the child docs, it requires reindexing to resolve the issue. I was thinking of adding another iteration to fetch parent version for childdocs without version. It might have significant performance impact on DBQ, though.

        Show
        manokovacs Mano Kovacs added a comment - Yonik Seeley , thanks for the hints, makes it much easier to test it. I am preparing the tests first, then make them pass with a fix. I was wondering, if the fix is to store the version with the child docs, it requires reindexing to resolve the issue. I was thinking of adding another iteration to fetch parent version for childdocs without version. It might have significant performance impact on DBQ, though.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        if the fix is to store the version with the child docs, it requires reindexing to resolve the issue.

        Right, this won't fix old indexes.

        I was thinking of adding another iteration to fetch parent version for childdocs without version.

        That seems difficult, unless we just assume that any doc w/o a version is a child doc.
        Also, another thing to watch out for is that the version field is technically not mandatory for non-solrcloud. The presence of a _root_ field could be used to further determine if a doc is a child doc, but that may be expensive too.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - if the fix is to store the version with the child docs, it requires reindexing to resolve the issue. Right, this won't fix old indexes. I was thinking of adding another iteration to fetch parent version for childdocs without version. That seems difficult, unless we just assume that any doc w/o a version is a child doc. Also, another thing to watch out for is that the version field is technically not mandatory for non-solrcloud. The presence of a _root_ field could be used to further determine if a doc is a child doc, but that may be expensive too.
        Hide
        mdrob Mike Drob added a comment -

        I think it makes sense to split the fix into two parts - one patch to take care of future indices and a separate fix to look at existing indices. Especially if one half of that is much easier and can be done significantly faster.

        Show
        mdrob Mike Drob added a comment - I think it makes sense to split the fix into two parts - one patch to take care of future indices and a separate fix to look at existing indices. Especially if one half of that is much easier and can be done significantly faster.
        Hide
        manokovacs Mano Kovacs added a comment -

        Make sense, thank you. I'll go with the new index fix now.

        Show
        manokovacs Mano Kovacs added a comment - Make sense, thank you. I'll go with the new index fix now.
        Hide
        manokovacs Mano Kovacs added a comment - - edited

        It seems like childnodes are not being inserted at all if there is any reordered DBQ. In case of reordering, the neither the insertion for children documents, nor the delete for previous child during update are being execute. Created new test cases for those too. Is that ok if I add the fixes altogether and we extend the title of this jira? Suggesting "Reordered delete-by-query cause inconsistency in child documents".

        Show
        manokovacs Mano Kovacs added a comment - - edited It seems like childnodes are not being inserted at all if there is any reordered DBQ. In case of reordering, the neither the insertion for children documents, nor the delete for previous child during update are being execute. Created new test cases for those too. Is that ok if I add the fixes altogether and we extend the title of this jira? Suggesting "Reordered delete-by-query cause inconsistency in child documents".
        Hide
        manokovacs Mano Kovacs added a comment -

        Adding SOLR-10114-validation.patch with 4 new tests, 3 of them are failing currently. This is not the actual patch, just to reproduce incorrect behaviors. Uploading fix shortly.

        Show
        manokovacs Mano Kovacs added a comment - Adding SOLR-10114 -validation.patch with 4 new tests, 3 of them are failing currently. This is not the actual patch, just to reproduce incorrect behaviors. Uploading fix shortly.
        Hide
        manokovacs Mano Kovacs added a comment -

        Adding patch with

        • fix by adding version for childdocs if there is
        • fix by using same insert-or-update logic when handling reordered DBQ
        • recovery and peersync tests.
        Show
        manokovacs Mano Kovacs added a comment - Adding patch with fix by adding version for childdocs if there is fix by using same insert-or-update logic when handling reordered DBQ recovery and peersync tests.
        Hide
        mdrob Mike Drob added a comment -

        Use existing ThrowingRunnable instead of new RunnableWithException

        We might possibly want to hide this new functionality behind a version check? Does the patch apply relatively easily to 6.5 as well?

        Can you help me understand the full scope of the problem here - child docs are only in danger of spurious delete until the next commit point, right? So if they make it to disk, even though they don't have versions, they are still safe from disappearing in the future.

        Show
        mdrob Mike Drob added a comment - Use existing ThrowingRunnable instead of new RunnableWithException We might possibly want to hide this new functionality behind a version check? Does the patch apply relatively easily to 6.5 as well? Can you help me understand the full scope of the problem here - child docs are only in danger of spurious delete until the next commit point, right? So if they make it to disk, even though they don't have versions, they are still safe from disappearing in the future.
        Hide
        manokovacs Mano Kovacs added a comment -

        Thank you, Mike Drob, I did not know about ThworingRunnable.

        We might possibly want to hide this new functionality behind a version check? Does the patch apply relatively easily to 6.5 as well?

        The patch relies on some changes of SOLR-5944, which is AFAIK will be backported too, however, I can create a 6.x patch too.

        Can you help me understand the full scope of the problem here - child docs are only in danger of spurious delete until the next commit point, right?

        So the reordered DBQ could happen if an update with an earlier version arrives after a DBQ with a later version to the replicas, or vica-versa. Solr handles the two cases the following:

        • If a DBQ arrives that has lower version than the latest updates, the DBQ gets an additional version filter to protect documents added earlier, with higher version.
          • If the DBQ is not by ID (or something limiting), but for example range or any, it will delete child-docs added with higher versioned parent doc. This is what the jira is originally about and testLogReplayWithReorderedDBQByAsterixAndChildDocs tests the case.
        • If an update arrives that has lower version than the latest DBQs, the DirectUpdateHandler2 goes on an add-and-delete path, where the earlier DBQs with higher versions are replayed after the update.

        Now, the doNormalUpdate(cmd) was checking if the document is block document (has children) and does two main differences based on that:

        • Calls updateDocuments (plural) that accepts an Iterable and inserts every child document
        • Builds idTerm by _root_ field, instead of id-field, so before adding the document, lucene would delete the parent AND the child documents as well.

        On the other hand, addAndDelete() did not do any differentiation for block docs, resulting the child-nodes ignored during the inserts and overwrites.
        So basically any reordered DBQ caused:

        • Losing child-docs when new document was inserted (testLogReplayWithReorderedDBQInsertingChildnodes)
        • Making the child-docs untouched on update. This caused replica numDocs inconsistency when the update contained different count of child-docs. (testLogReplayWithReorderedDBQUpdateWithDifferentChildCount)

        So basically, any child-docs replication was dropped if there was a reordered DBQ.

        So if they make it to disk, even though they don't have versions, they are still safe from disappearing in the future.

        AFAIK, the reordering cannot happen on the leader, this does not affects leader version, only replicas. I assume any peersync would fail due to fingerprint check, and would eventually replicate the correct index. Yonik Seeley, could you, please, verify my assumption?

        Show
        manokovacs Mano Kovacs added a comment - Thank you, Mike Drob , I did not know about ThworingRunnable. We might possibly want to hide this new functionality behind a version check? Does the patch apply relatively easily to 6.5 as well? The patch relies on some changes of SOLR-5944 , which is AFAIK will be backported too, however, I can create a 6.x patch too. Can you help me understand the full scope of the problem here - child docs are only in danger of spurious delete until the next commit point, right? So the reordered DBQ could happen if an update with an earlier version arrives after a DBQ with a later version to the replicas, or vica-versa. Solr handles the two cases the following: If a DBQ arrives that has lower version than the latest updates, the DBQ gets an additional version filter to protect documents added earlier, with higher version. If the DBQ is not by ID (or something limiting), but for example range or any, it will delete child-docs added with higher versioned parent doc. This is what the jira is originally about and testLogReplayWithReorderedDBQByAsterixAndChildDocs tests the case. If an update arrives that has lower version than the latest DBQs, the DirectUpdateHandler2 goes on an add-and-delete path, where the earlier DBQs with higher versions are replayed after the update. Now, the doNormalUpdate(cmd) was checking if the document is block document (has children) and does two main differences based on that: Calls updateDocuments (plural) that accepts an Iterable and inserts every child document Builds idTerm by _root_ field, instead of id-field, so before adding the document, lucene would delete the parent AND the child documents as well. On the other hand, addAndDelete() did not do any differentiation for block docs, resulting the child-nodes ignored during the inserts and overwrites. So basically any reordered DBQ caused: Losing child-docs when new document was inserted ( testLogReplayWithReorderedDBQInsertingChildnodes ) Making the child-docs untouched on update. This caused replica numDocs inconsistency when the update contained different count of child-docs. ( testLogReplayWithReorderedDBQUpdateWithDifferentChildCount ) So basically, any child-docs replication was dropped if there was a reordered DBQ. So if they make it to disk, even though they don't have versions, they are still safe from disappearing in the future. AFAIK, the reordering cannot happen on the leader, this does not affects leader version, only replicas. I assume any peersync would fail due to fingerprint check, and would eventually replicate the correct index. Yonik Seeley , could you, please, verify my assumption?
        Hide
        manokovacs Mano Kovacs added a comment -

        Using ThrowingRunnable in tests.

        Show
        manokovacs Mano Kovacs added a comment - Using ThrowingRunnable in tests.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        On the other hand, addAndDelete() did not do any differentiation for block docs

        Nice catch! Looks like that oversight has been there since the original block-join patch.

        AFAIK, the reordering cannot happen on the leader, this does not affects leader version, only replicas. I assume any peersync would fail due to fingerprint check, and would eventually replicate the correct index.

        Yeah, sounds right.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - On the other hand, addAndDelete() did not do any differentiation for block docs Nice catch! Looks like that oversight has been there since the original block-join patch. AFAIK, the reordering cannot happen on the leader, this does not affects leader version, only replicas. I assume any peersync would fail due to fingerprint check, and would eventually replicate the correct index. Yeah, sounds right.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 99188ae00c0c46d9af47b9773d492de40de4aa83 in lucene-solr's branch refs/heads/master from Yonik Seeley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=99188ae ]

        SOLR-10114: add version field to child documents, fix reordered-dbq to not drop child docs

        Show
        jira-bot ASF subversion and git services added a comment - Commit 99188ae00c0c46d9af47b9773d492de40de4aa83 in lucene-solr's branch refs/heads/master from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=99188ae ] SOLR-10114 : add version field to child documents, fix reordered-dbq to not drop child docs
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Committed. Thanks!

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Committed. Thanks!
        Hide
        manokovacs Mano Kovacs added a comment -

        Failed to upload the latest patch yesterday, attaching cleanup patch.

        Show
        manokovacs Mano Kovacs added a comment - Failed to upload the latest patch yesterday, attaching cleanup patch.
        Hide
        manokovacs Mano Kovacs added a comment -

        Attaching second patch with more tests. First patch was missing a commit. Includes additional tests in PeerSyncTest.java, also made every branch of the test running, instead of randomization.

        This includes the test-cleanup patch too.

        Show
        manokovacs Mano Kovacs added a comment - Attaching second patch with more tests. First patch was missing a commit. Includes additional tests in PeerSyncTest.java , also made every branch of the test running, instead of randomization. This includes the test-cleanup patch too.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit d49edabf8992c2b2f9e2583e289cc58a4e71fd31 in lucene-solr's branch refs/heads/master from Yonik Seeley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d49edab ]

        SOLR-10114: test cleanup

        Show
        jira-bot ASF subversion and git services added a comment - Commit d49edabf8992c2b2f9e2583e289cc58a4e71fd31 in lucene-solr's branch refs/heads/master from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d49edab ] SOLR-10114 : test cleanup
        Hide
        steve_rowe Steve Rowe added a comment - - edited

        git bisect points the finger at the 99188ae00c0c46d9af47b9773d492de40de4aa83 commit under this issue for reproducing TestRecovery failures - all three succeed just before this commit and fail after it. I had to remove the -Dtests.method=testCorruptLog cmdline param to get these to reproduce, so maybe there's some method order dependence here?:

        https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/18975/:

           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRecovery -Dtests.method=testCorruptLog -Dtests.seed=87E0BD7E2E527DCE -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=sah-RU -Dtests.timezone=America/North_Dakota/Center -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
           [junit4] ERROR   1.29s J1 | TestRecovery.testCorruptLog <<<
           [junit4]    > Throwable #1: java.lang.RuntimeException: mismatch: '3'!='0' @ response/numFound
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([87E0BD7E2E527DCE:753D09ADD46A2C12]:0)
           [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:1006)
           [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:953)
           [junit4]    > 	at org.apache.solr.search.TestRecovery.testCorruptLog(TestRecovery.java:1274)
           [junit4]    > 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           [junit4]    > 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           [junit4]    > 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           [junit4]    > 	at java.base/java.lang.reflect.Method.invoke(Method.java:543)
        [...]
           [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {val_i=Lucene50(blocksize=128), _root_=PostingsFormat(name=Direct), id=Lucene50(blocksize=128)}, docValues:{_version_=DocValuesFormat(name=Lucene70), val_i_dvo=DocValuesFormat(name=Lucene70)}, maxPointsInLeafNode=1693, maxMBSortInHeap=6.736398983719205, sim=RandomSimilarity(queryNorm=true): {}, locale=sah-RU, timezone=America/North_Dakota/Center
           [junit4]   2> NOTE: Linux 4.4.0-53-generic i386/Oracle Corporation 9-ea (32-bit)/cpus=12,threads=1,free=183244888,total=536870912
        

        https://jenkins.thetaphi.de/job/Lucene-Solr-master-Solaris/1136/:

           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRecovery -Dtests.method=testCorruptLog -Dtests.seed=79A0B057C8C8D5DB -Dtests.slow=true -Dtests.locale=ar-OM -Dtests.timezone=Asia/Krasnoyarsk -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
           [junit4] ERROR   0.45s J1 | TestRecovery.testCorruptLog <<<
           [junit4]    > Throwable #1: java.lang.RuntimeException: mismatch: '3'!='0' @ response/numFound
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([79A0B057C8C8D5DB:8B7D048432F08407]:0)
           [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:1006)
           [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:953)
           [junit4]    > 	at org.apache.solr.search.TestRecovery.testCorruptLog(TestRecovery.java:1274)
        [...]
           [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {val_i=BlockTreeOrds(blocksize=128), _root_=PostingsFormat(name=LuceneFixedGap), id=BlockTreeOrds(blocksize=128)}, docValues:{_version_=DocValuesFormat(name=Memory), val_i_dvo=DocValuesFormat(name=Direct)}, maxPointsInLeafNode=586, maxMBSortInHeap=6.1661193810062755, sim=RandomSimilarity(queryNorm=false): {}, locale=ar-OM, timezone=Asia/Krasnoyarsk
           [junit4]   2> NOTE: SunOS 5.11 amd64/Oracle Corporation 1.8.0_121 (64-bit)/cpus=3,threads=1,free=163066016,total=536870912
        

        https://jenkins.thetaphi.de/job/Lucene-Solr-master-MacOSX/3836/:

           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRecovery -Dtests.method=testCorruptLog -Dtests.seed=A33AFB76746001BE -Dtests.slow=true -Dtests.locale=ar -Dtests.timezone=Universal -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
           [junit4] ERROR   0.66s J1 | TestRecovery.testCorruptLog <<<
           [junit4]    > Throwable #1: java.lang.RuntimeException: mismatch: '3'!='0' @ response/numFound
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([A33AFB76746001BE:51E74FA58E585062]:0)
           [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:1006)
           [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:953)
           [junit4]    > 	at org.apache.solr.search.TestRecovery.testCorruptLog(TestRecovery.java:1274)
        [...]
           [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {val_i=BlockTreeOrds(blocksize=128), _root_=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))), id=BlockTreeOrds(blocksize=128)}, docValues:{_version_=DocValuesFormat(name=Direct), val_i_dvo=DocValuesFormat(name=Memory)}, maxPointsInLeafNode=1861, maxMBSortInHeap=5.220105645523462, sim=RandomSimilarity(queryNorm=false): {}, locale=ar, timezone=Universal
           [junit4]   2> NOTE: Mac OS X 10.11.6 x86_64/Oracle Corporation 1.8.0_121 (64-bit)/cpus=3,threads=1,free=171695816,total=536870912
        
        Show
        steve_rowe Steve Rowe added a comment - - edited git bisect points the finger at the 99188ae00c0c46d9af47b9773d492de40de4aa83 commit under this issue for reproducing TestRecovery failures - all three succeed just before this commit and fail after it. I had to remove the -Dtests.method=testCorruptLog cmdline param to get these to reproduce, so maybe there's some method order dependence here?: https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/18975/ : [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRecovery -Dtests.method=testCorruptLog -Dtests.seed=87E0BD7E2E527DCE -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=sah-RU -Dtests.timezone=America/North_Dakota/Center -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 1.29s J1 | TestRecovery.testCorruptLog <<< [junit4] > Throwable #1: java.lang.RuntimeException: mismatch: '3'!='0' @ response/numFound [junit4] > at __randomizedtesting.SeedInfo.seed([87E0BD7E2E527DCE:753D09ADD46A2C12]:0) [junit4] > at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:1006) [junit4] > at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:953) [junit4] > at org.apache.solr.search.TestRecovery.testCorruptLog(TestRecovery.java:1274) [junit4] > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit4] > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [junit4] > at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [junit4] > at java.base/java.lang.reflect.Method.invoke(Method.java:543) [...] [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {val_i=Lucene50(blocksize=128), _root_=PostingsFormat(name=Direct), id=Lucene50(blocksize=128)}, docValues:{_version_=DocValuesFormat(name=Lucene70), val_i_dvo=DocValuesFormat(name=Lucene70)}, maxPointsInLeafNode=1693, maxMBSortInHeap=6.736398983719205, sim=RandomSimilarity(queryNorm=true): {}, locale=sah-RU, timezone=America/North_Dakota/Center [junit4] 2> NOTE: Linux 4.4.0-53-generic i386/Oracle Corporation 9-ea (32-bit)/cpus=12,threads=1,free=183244888,total=536870912 https://jenkins.thetaphi.de/job/Lucene-Solr-master-Solaris/1136/ : [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRecovery -Dtests.method=testCorruptLog -Dtests.seed=79A0B057C8C8D5DB -Dtests.slow=true -Dtests.locale=ar-OM -Dtests.timezone=Asia/Krasnoyarsk -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 0.45s J1 | TestRecovery.testCorruptLog <<< [junit4] > Throwable #1: java.lang.RuntimeException: mismatch: '3'!='0' @ response/numFound [junit4] > at __randomizedtesting.SeedInfo.seed([79A0B057C8C8D5DB:8B7D048432F08407]:0) [junit4] > at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:1006) [junit4] > at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:953) [junit4] > at org.apache.solr.search.TestRecovery.testCorruptLog(TestRecovery.java:1274) [...] [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {val_i=BlockTreeOrds(blocksize=128), _root_=PostingsFormat(name=LuceneFixedGap), id=BlockTreeOrds(blocksize=128)}, docValues:{_version_=DocValuesFormat(name=Memory), val_i_dvo=DocValuesFormat(name=Direct)}, maxPointsInLeafNode=586, maxMBSortInHeap=6.1661193810062755, sim=RandomSimilarity(queryNorm=false): {}, locale=ar-OM, timezone=Asia/Krasnoyarsk [junit4] 2> NOTE: SunOS 5.11 amd64/Oracle Corporation 1.8.0_121 (64-bit)/cpus=3,threads=1,free=163066016,total=536870912 https://jenkins.thetaphi.de/job/Lucene-Solr-master-MacOSX/3836/ : [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRecovery -Dtests.method=testCorruptLog -Dtests.seed=A33AFB76746001BE -Dtests.slow=true -Dtests.locale=ar -Dtests.timezone=Universal -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 0.66s J1 | TestRecovery.testCorruptLog <<< [junit4] > Throwable #1: java.lang.RuntimeException: mismatch: '3'!='0' @ response/numFound [junit4] > at __randomizedtesting.SeedInfo.seed([A33AFB76746001BE:51E74FA58E585062]:0) [junit4] > at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:1006) [junit4] > at org.apache.solr.SolrTestCaseJ4.assertJQ(SolrTestCaseJ4.java:953) [junit4] > at org.apache.solr.search.TestRecovery.testCorruptLog(TestRecovery.java:1274) [...] [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {val_i=BlockTreeOrds(blocksize=128), _root_=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128))), id=BlockTreeOrds(blocksize=128)}, docValues:{_version_=DocValuesFormat(name=Direct), val_i_dvo=DocValuesFormat(name=Memory)}, maxPointsInLeafNode=1861, maxMBSortInHeap=5.220105645523462, sim=RandomSimilarity(queryNorm=false): {}, locale=ar, timezone=Universal [junit4] 2> NOTE: Mac OS X 10.11.6 x86_64/Oracle Corporation 1.8.0_121 (64-bit)/cpus=3,threads=1,free=171695816,total=536870912
        Hide
        steve_rowe Steve Rowe added a comment -

        FYI all three repro lines above still reproduce after commit d49edabf8992c2b2f9e2583e289cc58a4e71fd31.

        Show
        steve_rowe Steve Rowe added a comment - FYI all three repro lines above still reproduce after commit d49edabf8992c2b2f9e2583e289cc58a4e71fd31 .
        Hide
        manokovacs Mano Kovacs added a comment -

        Steve Rowe, thanks for pointing it out. I just noticed that too locally. Its some interdependency between the tests that I am trying to work out.

        Show
        manokovacs Mano Kovacs added a comment - Steve Rowe , thanks for pointing it out. I just noticed that too locally. Its some interdependency between the tests that I am trying to work out.
        Hide
        manokovacs Mano Kovacs added a comment - - edited

        Yeah, conflict between tests.

        26292 INFO  (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [    ] o.a.s.u.DirectUpdateHandler2 Reordered DBQs detected.  Update=add{_version_=104,id=G4} DBQs=[DBQ{version=1017,q=id:*}]
        26331 INFO  (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [    ] o.a.s.s.SolrIndexSearcher Opening [Searcher@26c39adc[collection1] realtime]
        26331 INFO  (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [    ] o.a.s.u.p.LogUpdateProcessorFactory [collection1]  webapp=null path=null params={update.distrib=FROMLEADER&wt=json&indent=true}{add=[G4 (104)]} 0 40
        26332 INFO  (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [    ] o.a.s.u.DirectUpdateHandler2 Reordered DBQs detected.  Update=add{_version_=105,id=G5} DBQs=[DBQ{version=1017,q=id:*}]
        26349 INFO  (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [    ] o.a.s.s.SolrIndexSearcher Opening [Searcher@17fd92d4[collection1] realtime]
        26349 INFO  (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [    ] o.a.s.u.p.LogUpdateProcessorFactory [collection1]  webapp=null path=null params={update.distrib=FROMLEADER&wt=json&indent=true}{add=[G5 (105)]} 0 17
        26350 INFO  (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [    ] o.a.s.u.DirectUpdateHandler2 Reordered DBQs detected.  Update=add{_version_=106,id=G6} DBQs=[DBQ{version=1017,q=id:*}]
        26374 INFO  (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [    ] o.a.s.s.SolrIndexSearcher Opening [Searcher@6453c66a[collection1] realtime]
        

        The id:* is from the newly added tests. SOLR-9941 was supposed to resolve this, but it does not, I am trying to work it out but any idea is welcome.

        Show
        manokovacs Mano Kovacs added a comment - - edited Yeah, conflict between tests. 26292 INFO (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [ ] o.a.s.u.DirectUpdateHandler2 Reordered DBQs detected. Update=add{_version_=104,id=G4} DBQs=[DBQ{version=1017,q=id:*}] 26331 INFO (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [ ] o.a.s.s.SolrIndexSearcher Opening [Searcher@26c39adc[collection1] realtime] 26331 INFO (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [ ] o.a.s.u.p.LogUpdateProcessorFactory [collection1] webapp=null path=null params={update.distrib=FROMLEADER&wt=json&indent=true}{add=[G4 (104)]} 0 40 26332 INFO (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [ ] o.a.s.u.DirectUpdateHandler2 Reordered DBQs detected. Update=add{_version_=105,id=G5} DBQs=[DBQ{version=1017,q=id:*}] 26349 INFO (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [ ] o.a.s.s.SolrIndexSearcher Opening [Searcher@17fd92d4[collection1] realtime] 26349 INFO (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [ ] o.a.s.u.p.LogUpdateProcessorFactory [collection1] webapp=null path=null params={update.distrib=FROMLEADER&wt=json&indent=true}{add=[G5 (105)]} 0 17 26350 INFO (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [ ] o.a.s.u.DirectUpdateHandler2 Reordered DBQs detected. Update=add{_version_=106,id=G6} DBQs=[DBQ{version=1017,q=id:*}] 26374 INFO (TEST-TestRecovery.testCorruptLog-seed#[BC979C5AA13AC6F7]) [ ] o.a.s.s.SolrIndexSearcher Opening [Searcher@6453c66a[collection1] realtime] The id:* is from the newly added tests. SOLR-9941 was supposed to resolve this, but it does not, I am trying to work it out but any idea is welcome.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Test methods of a test class are run in a randomized order. Transaction log tests are tricky since tests can see what the previous tests left behind in the transaction logs. This is why I used different ID spaces for some of these tests (id:A1,A2,... for one test, id:B1,B2, etc for another test).

        Perhaps try making any delete-by-queries test-specific?

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Test methods of a test class are run in a randomized order. Transaction log tests are tricky since tests can see what the previous tests left behind in the transaction logs. This is why I used different ID spaces for some of these tests (id:A1,A2,... for one test, id:B1,B2, etc for another test). Perhaps try making any delete-by-queries test-specific?
        Hide
        manokovacs Mano Kovacs added a comment -

        Yeah, that could be simple solution, but saying that we cannot test id:* for that reason is not too robust. I am thinking of creating an only incrementing version counter, so tests could not insert future deletes for each other. It would be a future-proof solution, I guess.

        Show
        manokovacs Mano Kovacs added a comment - Yeah, that could be simple solution, but saying that we cannot test id:* for that reason is not too robust. I am thinking of creating an only incrementing version counter, so tests could not insert future deletes for each other. It would be a future-proof solution, I guess.
        Hide
        manokovacs Mano Kovacs added a comment -

        Adding test fix
        I went with namespacing, as monotonic counter would be a bigger change (added jira for that SOLR-10151).
        I changed the id:* to delete by id and by root, since the purpose of the test was to validate that DBQ replayed version filter protects child-docs as well.

        Sorry for the inconvenience.

        Show
        manokovacs Mano Kovacs added a comment - Adding test fix I went with namespacing, as monotonic counter would be a bigger change (added jira for that SOLR-10151 ). I changed the id:* to delete by id and by root, since the purpose of the test was to validate that DBQ replayed version filter protects child-docs as well. Sorry for the inconvenience.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 7a45f1015e73bc4f793a63be6b7414d53e008e05 in lucene-solr's branch refs/heads/master from Yonik Seeley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7a45f10 ]

        SOLR-10114: fix flakey TestRecovery

        Show
        jira-bot ASF subversion and git services added a comment - Commit 7a45f1015e73bc4f793a63be6b7414d53e008e05 in lucene-solr's branch refs/heads/master from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7a45f10 ] SOLR-10114 : fix flakey TestRecovery
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ad9195d757c298a241ef2488b4b17623a44afdd7 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ad9195d ]

        SOLR-10114: add version field to child documents, fix reordered-dbq to not drop child docs

        Show
        jira-bot ASF subversion and git services added a comment - Commit ad9195d757c298a241ef2488b4b17623a44afdd7 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ad9195d ] SOLR-10114 : add version field to child documents, fix reordered-dbq to not drop child docs
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5c76710f08225d6909c96e584888bb6f036b4cfe in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5c76710 ]

        SOLR-10114: test cleanup

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5c76710f08225d6909c96e584888bb6f036b4cfe in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5c76710 ] SOLR-10114 : test cleanup
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 98133b21961c7c9672bcd85d2a2713e46f3242db in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=98133b2 ]

        SOLR-10114: fix flakey TestRecovery

        Show
        jira-bot ASF subversion and git services added a comment - Commit 98133b21961c7c9672bcd85d2a2713e46f3242db in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=98133b2 ] SOLR-10114 : fix flakey TestRecovery
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ea19bf5101817bae5b7b133a7d9d40ab41aac6ec in lucene-solr's branch refs/heads/master from Ishan Chattopadhyaya
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ea19bf5 ]

        Move solr/CHANGES.txt entries to appropriate sections after backporting SOLR-5944 and SOLR-10114

        Show
        jira-bot ASF subversion and git services added a comment - Commit ea19bf5101817bae5b7b133a7d9d40ab41aac6ec in lucene-solr's branch refs/heads/master from Ishan Chattopadhyaya [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ea19bf5 ] Move solr/CHANGES.txt entries to appropriate sections after backporting SOLR-5944 and SOLR-10114

          People

          • Assignee:
            yseeley@gmail.com Yonik Seeley
            Reporter:
            yseeley@gmail.com Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development