Solr
  1. Solr
  2. SOLR-5709

Highlighting grouped duplicate docs from different shards with group.limit > 1 throws ArrayIndexOutOfBoundsException

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3, 4.4, 4.5, 4.6, 6.0
    • Fix Version/s: 4.7, 6.0
    • Component/s: highlighter
    • Labels:
      None

      Description

      In a sharded (non-SolrCloud) deployment, if you index a document with the same unique key value into more than one shard, and then try to highlight grouped docs with more than one doc per group, where the grouped docs contain at least one duplicate doc pair, you get an AIOOBE.

      Here's the stack trace I got from such a situation, with 1 doc indexed into each shard in a 2-shard index, with group.limit=2:

      ERROR null:java.lang.ArrayIndexOutOfBoundsException: 1
      		at org.apache.solr.handler.component.HighlightComponent.finishStage(HighlightComponent.java:185)
      		at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:328)
      		at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
      		at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
      		at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:758)
      		at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:412)
      		at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:202)
      		at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
      		at org.apache.solr.client.solrj.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:136)
      		at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
      		at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
      		at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:229)
      		at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
      		at org.eclipse.jetty.server.handler.GzipHandler.handle(GzipHandler.java:301)
      		at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1077)
      		at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
      		at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
      		at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
      		at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
      		at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
      		at org.eclipse.jetty.server.Server.handle(Server.java:368)
      		at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
      		at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
      		at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
      		at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
      		at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
      		at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
      		at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:628)
      		at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
      		at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
      		at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
      		at java.lang.Thread.run(Thread.java:724)
      

        Activity

        Hide
        Steve Rowe added a comment -

        Simple patch fixing the issue, with tests that fail before the fix and succeed afterward.

        Here's what's wrong with the current implementation:

        1. In TopGroupsShardResponseProcessor.process(), when grouped ShardDoc-s are put into the response's resultIds unique-key => ShardDoc map, each ShardDoc's positionInResponse is set assuming that there will be no duplicate unique-key values. So when a duplicate is added, the resultIds cardinality doesn't change, but the 1-up positionInResponse does, and so is thereafter no longer valid. As a result, when the highlighting component attempts to use ShardDoc-s' positionInResponse to index into an array with the same cardinality as resultIds, an AIOOBE is thrown.
        2. When duplicate docs replace previous docs in the resultIds map, the lowest-sorted duplicate is kept, rather than the highest-sorted.

        The patch fixes both these problems.

        Show
        Steve Rowe added a comment - Simple patch fixing the issue, with tests that fail before the fix and succeed afterward. Here's what's wrong with the current implementation: In TopGroupsShardResponseProcessor.process() , when grouped ShardDoc -s are put into the response's resultIds unique-key => ShardDoc map, each ShardDoc 's positionInResponse is set assuming that there will be no duplicate unique-key values. So when a duplicate is added, the resultIds cardinality doesn't change, but the 1-up positionInResponse does, and so is thereafter no longer valid. As a result, when the highlighting component attempts to use ShardDoc -s' positionInResponse to index into an array with the same cardinality as resultIds , an AIOOBE is thrown. When duplicate docs replace previous docs in the resultIds map, the lowest-sorted duplicate is kept, rather than the highest-sorted. The patch fixes both these problems.
        Hide
        Steve Rowe added a comment -

        I'll commit tomorrow if there are no objections.

        Show
        Steve Rowe added a comment - I'll commit tomorrow if there are no objections.
        Hide
        ASF subversion and git services added a comment -

        Commit 1566743 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1566743 ]

        SOLR-5709: Highlighting grouped duplicate docs from different shards with group.limit > 1 throws ArrayIndexOutOfBoundsException

        Show
        ASF subversion and git services added a comment - Commit 1566743 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1566743 ] SOLR-5709 : Highlighting grouped duplicate docs from different shards with group.limit > 1 throws ArrayIndexOutOfBoundsException
        Hide
        ASF subversion and git services added a comment -

        Commit 1566746 from Steve Rowe in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1566746 ]

        SOLR-5709: Highlighting grouped duplicate docs from different shards with group.limit > 1 throws ArrayIndexOutOfBoundsException (merged trunk r1566743)

        Show
        ASF subversion and git services added a comment - Commit 1566746 from Steve Rowe in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1566746 ] SOLR-5709 : Highlighting grouped duplicate docs from different shards with group.limit > 1 throws ArrayIndexOutOfBoundsException (merged trunk r1566743)
        Hide
        Steve Rowe added a comment -

        Committed to branch_4x and trunk.

        Show
        Steve Rowe added a comment - Committed to branch_4x and trunk.
        Hide
        ASF subversion and git services added a comment -

        Commit 1566914 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1566914 ]

        SOLR-5709: add missing license

        Show
        ASF subversion and git services added a comment - Commit 1566914 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1566914 ] SOLR-5709 : add missing license
        Hide
        ASF subversion and git services added a comment -

        Commit 1566918 from Steve Rowe in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1566918 ]

        SOLR-5709: add missing license (merged trunk r1566914)

        Show
        ASF subversion and git services added a comment - Commit 1566918 from Steve Rowe in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1566918 ] SOLR-5709 : add missing license (merged trunk r1566914)
        Hide
        Rob Tulloh added a comment -

        With this patch, you don't get AIOOB, but hit count and documents returned do not match.

        I wrote a script to test this. I have 10 documents on one shard. I then duplicated 6 of the documents on another shard. Hit count reports 16, but actual documents returned is 10. I think the values should agree.

        Mon Feb 17 12:50:02 2014 numDocs 16 query cost 1.1669998168945312
        Mon Feb 17 12:50:02 2014 numDocs 0 query cost 0.1399998664855957
        Done counted 10

        Show
        Rob Tulloh added a comment - With this patch, you don't get AIOOB, but hit count and documents returned do not match. I wrote a script to test this. I have 10 documents on one shard. I then duplicated 6 of the documents on another shard. Hit count reports 16, but actual documents returned is 10. I think the values should agree. Mon Feb 17 12:50:02 2014 numDocs 16 query cost 1.1669998168945312 Mon Feb 17 12:50:02 2014 numDocs 0 query cost 0.1399998664855957 Done counted 10
        Hide
        Steve Rowe added a comment -

        Hi Rob,

        With this patch, you don't get AIOOB, but hit count and documents returned do not match.

        To which hit count are you referring?

        The grouped document count reflects all matching documents, including duplicated docs. Since the returned grouped docs include duplicates, this seems correct to me: the reported count matches the set cardinality.

        When highlighting and grouping, I don't see a hit count for the highlighted docs – I'm guessing it's the highlighted docs that you're referring to when you say "actual documents returned".

        Can you attach a response illustrating what you're talking about?

        Show
        Steve Rowe added a comment - Hi Rob, With this patch, you don't get AIOOB, but hit count and documents returned do not match. To which hit count are you referring? The grouped document count reflects all matching documents, including duplicated docs. Since the returned grouped docs include duplicates, this seems correct to me: the reported count matches the set cardinality. When highlighting and grouping, I don't see a hit count for the highlighted docs – I'm guessing it's the highlighted docs that you're referring to when you say "actual documents returned". Can you attach a response illustrating what you're talking about?
        Hide
        Rob Tulloh added a comment -

        Thank you for the question. Let me try and explain with an example.

        We use field collapsing to return all the documents associated with a single ID. We parse the results and apply highlighting for end users to see how their search terms were matched in the returned text. In the case of the test I am running, there are 10 unique IDs to be found. The fact that 6 documents are duplicated should not impact the unique number of groups returned. In fact, that is proven because we count 10 results when we iterate the results. What I would also expect is that the hit count (ngroups) would reflect this. Here is a query result to demonstrate the issue. Note that the group field is group.field=storageid

        [root@aggregator-1 solr]# wget -O- -q 'http://localhost:8983/solr/select?params={hl.requireFieldMatch=true&group.ngroups=true&group.limit=1000&isPartial=0&hl.simple.pre=<b>&hl.fl=*&wt=xml&hl=true&rows=1&EmsQueryId=INTERNAL&f.mailsubject2.qf=mailsubject&shards=archive-8.ems.labmanager.net:8983/solr,archive-6.ems.labmanager.net:8983/solr&start=0&q=customerid:352&f.body2.qf=body&group.field=storageid&hl.simple.post=</b>&group=true&qt=/search-any&EmsQueryTs=1392658773339}'
        

        And the output. Note the value of matches and ngroups in the output.

        <?xml version="1.0" encoding="UTF-8"?>
        <response>
        <lst name="responseHeader"><int name="status">0</int><int name="QTime">39</int><lst name="params"><str name="group.ngroups">true</str><str name="group.limit">1000</str><str name="isPartial">0</str><str name="hl.simple.pre">&lt;b&gt;</str><str name="params">{hl.requireFieldMatch=true</str><str name="hl.fl">*</str><str name="wt">xml</str><str name="hl">true</str><str name="rows">1</str><str name="EmsQueryId">INTERNAL</str><str name="f.mailsubject2.qf">mailsubject</str><str name="shards">archive-8.ems.labmanager.net:8983/solr,archive-6.ems.labmanager.net:8983/solr</str><str name="start">0</str><str name="q">customerid:352</str><str name="f.body2.qf">body</str><str name="group.field">storageid</str><str name="hl.simple.post">&lt;/b&gt;</str><str name="group">true</str><str name="qt">/search-any</str><str name="EmsQueryTs">1392658773339}</str></lst></lst><lst name="grouped"><lst name="storageid"><int name="matches">16</int><int name="ngroups">16</int><arr name="groups"><lst><long name="groupValue">43937</long><result name="doclist" numFound="1" start="0" maxScore="7.0955024"><doc><str name="contentid">43937</str><int name="senderid">12759</int><arr name="recipientids"><int>12741</int></arr><long name="storageid">43937</long><date name="receiveddate">2000-12-12T11:07:00Z</date><str name="mailfrom">donna.martens@enron.com</str><str name="envsender">donna.martens@enron.com</str><str name="mailto">sharon.solon@enron.com </str><int name="partitionid">1</int><str name="indexlevel">0</str><str name="mailcc">david.roensch@enron.com rich.jolly@enron.com maria.pavlou@enron.com dorothy.mccoppin@enron.com kevin.hyatt@enron.com mary.darveaux@enron.com lorraine.lindberg@enron.com steven.harris@enron.com drew.fossum@enron.com earl.chanley@enron.com ronald.matthews@enron.com arnold.eisenstein@enron.com jerry.martin@enron.com patrick.brennan@enron.com dave.waymire@enron.com keith.petersen@enron.com </str><int name="importance">1</int><date name="emaildate">2000-12-12T11:07:00Z</date><int name="customerid">352</int><int name="igen1">0</int><int name="totalsize">3780</int><bool name="isattachment">false</bool><str name="mime">text/plain</str><int name="clusterlocationid">102</int><int name="islandid">101</int><int name="size">2240</int><str name="language">en</str><str name="mailsubject_en">Re: Gallup Expansion</str><str name="mailsubject2_en">Re: Gallup Expansion</str><long name="_version_">1460308152307154944</long><date name="processingtime">2014-02-17T17:32:58.887Z</date></doc></result></lst></arr></lst></lst><lst name="highlighting"><lst name="43937"/></lst>
        </response>
        

        There are exactly 10 unique results associated with that field. I can understand matches being 16 (the number of documents matching the query), but I would expect ngroups to be 10 for the number of unique groups being returned. Our code reads ngroups and returns this as the hit count for the query so that we report to the caller the number of unique hits observed.

        I hope I have made it clear. Please let me know if I can answer any more questions.

        Show
        Rob Tulloh added a comment - Thank you for the question. Let me try and explain with an example. We use field collapsing to return all the documents associated with a single ID. We parse the results and apply highlighting for end users to see how their search terms were matched in the returned text. In the case of the test I am running, there are 10 unique IDs to be found. The fact that 6 documents are duplicated should not impact the unique number of groups returned. In fact, that is proven because we count 10 results when we iterate the results. What I would also expect is that the hit count (ngroups) would reflect this. Here is a query result to demonstrate the issue. Note that the group field is group.field=storageid [root@aggregator-1 solr]# wget -O- -q 'http://localhost:8983/solr/select?params={hl.requireFieldMatch=true&group.ngroups=true&group.limit=1000&isPartial=0&hl.simple.pre=<b>&hl.fl=*&wt=xml&hl=true&rows=1&EmsQueryId=INTERNAL&f.mailsubject2.qf=mailsubject&shards=archive-8.ems.labmanager.net:8983/solr,archive-6.ems.labmanager.net:8983/solr&start=0&q=customerid:352&f.body2.qf=body&group.field=storageid&hl.simple.post=</b>&group=true&qt=/search-any&EmsQueryTs=1392658773339}' And the output. Note the value of matches and ngroups in the output. <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">39</int><lst name="params"><str name="group.ngroups">true</str><str name="group.limit">1000</str><str name="isPartial">0</str><str name="hl.simple.pre">&lt;b&gt;</str><str name="params">{hl.requireFieldMatch=true</str><str name="hl.fl">*</str><str name="wt">xml</str><str name="hl">true</str><str name="rows">1</str><str name="EmsQueryId">INTERNAL</str><str name="f.mailsubject2.qf">mailsubject</str><str name="shards">archive-8.ems.labmanager.net:8983/solr,archive-6.ems.labmanager.net:8983/solr</str><str name="start">0</str><str name="q">customerid:352</str><str name="f.body2.qf">body</str><str name="group.field">storageid</str><str name="hl.simple.post">&lt;/b&gt;</str><str name="group">true</str><str name="qt">/search-any</str><str name="EmsQueryTs">1392658773339}</str></lst></lst><lst name="grouped"><lst name="storageid"><int name="matches">16</int><int name="ngroups">16</int><arr name="groups"><lst><long name="groupValue">43937</long><result name="doclist" numFound="1" start="0" maxScore="7.0955024"><doc><str name="contentid">43937</str><int name="senderid">12759</int><arr name="recipientids"><int>12741</int></arr><long name="storageid">43937</long><date name="receiveddate">2000-12-12T11:07:00Z</date><str name="mailfrom">donna.martens@enron.com</str><str name="envsender">donna.martens@enron.com</str><str name="mailto">sharon.solon@enron.com </str><int name="partitionid">1</int><str name="indexlevel">0</str><str name="mailcc">david.roensch@enron.com rich.jolly@enron.com maria.pavlou@enron.com dorothy.mccoppin@enron.com kevin.hyatt@enron.com mary.darveaux@enron.com lorraine.lindberg@enron.com steven.harris@enron.com drew.fossum@enron.com earl.chanley@enron.com ronald.matthews@enron.com arnold.eisenstein@enron.com jerry.martin@enron.com patrick.brennan@enron.com dave.waymire@enron.com keith.petersen@enron.com </str><int name="importance">1</int><date name="emaildate">2000-12-12T11:07:00Z</date><int name="customerid">352</int><int name="igen1">0</int><int name="totalsize">3780</int><bool name="isattachment">false</bool><str name="mime">text/plain</str><int name="clusterlocationid">102</int><int name="islandid">101</int><int name="size">2240</int><str name="language">en</str><str name="mailsubject_en">Re: Gallup Expansion</str><str name="mailsubject2_en">Re: Gallup Expansion</str><long name="_version_">1460308152307154944</long><date name="processingtime">2014-02-17T17:32:58.887Z</date></doc></result></lst></arr></lst></lst><lst name="highlighting"><lst name="43937"/></lst> </response> There are exactly 10 unique results associated with that field. I can understand matches being 16 (the number of documents matching the query), but I would expect ngroups to be 10 for the number of unique groups being returned. Our code reads ngroups and returns this as the hit count for the query so that we report to the caller the number of unique hits observed. I hope I have made it clear. Please let me know if I can answer any more questions.
        Hide
        Rob Tulloh added a comment -

        Here is an example where we get the result we expect (no duplicates involved):

        In this case 3 documents map to the same storageid parent. Notice that in this result, matches is 3 (number of documents) and ngroups is 1 (the number of unique results):

        <lst name="responseHeader"><int name="status">0</int><int name="QTime">194</int><lst name="params"><str name="group.ngroups">true</str><str name="group.limit">1000</str><str name="isPartial">0</str><str name="hl.simple.pre">&lt;b&gt;</str><str name="params">{hl.requireFieldMatch=true</str><str name="hl.fl">*</str><str name="wt">xml</str><str name="hl">true</str><str name="rows">1</str><str name="EmsQueryId">INTERNAL</str><str name="f.mailsubject2.qf">mailsubject</str><str name="shards">archive-8.ems.labmanager.net:8983/solr,archive-6.ems.labmanager.net:8983/solr</str><str name="start">0</str><str name="q">customerid:352</str><str name="f.body2.qf">body</str><str name="group.field">storageid</str><str name="hl.simple.post">&lt;/b&gt;</str><str name="group">true</str><str name="qt">/search-any</str><str name="fq">{!lucene}storageid:{44414 TO 44415]</str><str name="EmsQueryTs">1392658773339}</str></lst></lst><lst name="grouped"><lst name="storageid"><int name="matches">3</int><int name="ngroups">1</int><arr name="groups"><lst><long name="groupValue">44415</long><result name="doclist" numFound="3" start="0" maxScore="2.8401346">
        

        What puzzles me is why the result with duplicates doesn't group the same way. Clearly, this result shows it does work without duplicates involved. Is there an explanation for why this is the case?

        Show
        Rob Tulloh added a comment - Here is an example where we get the result we expect (no duplicates involved): In this case 3 documents map to the same storageid parent. Notice that in this result, matches is 3 (number of documents) and ngroups is 1 (the number of unique results): <lst name="responseHeader"><int name="status">0</int><int name="QTime">194</int><lst name="params"><str name="group.ngroups">true</str><str name="group.limit">1000</str><str name="isPartial">0</str><str name="hl.simple.pre">&lt;b&gt;</str><str name="params">{hl.requireFieldMatch=true</str><str name="hl.fl">*</str><str name="wt">xml</str><str name="hl">true</str><str name="rows">1</str><str name="EmsQueryId">INTERNAL</str><str name="f.mailsubject2.qf">mailsubject</str><str name="shards">archive-8.ems.labmanager.net:8983/solr,archive-6.ems.labmanager.net:8983/solr</str><str name="start">0</str><str name="q">customerid:352</str><str name="f.body2.qf">body</str><str name="group.field">storageid</str><str name="hl.simple.post">&lt;/b&gt;</str><str name="group">true</str><str name="qt">/search-any</str><str name="fq">{!lucene}storageid:{44414 TO 44415]</str><str name="EmsQueryTs">1392658773339}</str></lst></lst><lst name="grouped"><lst name="storageid"><int name="matches">3</int><int name="ngroups">1</int><arr name="groups"><lst><long name="groupValue">44415</long><result name="doclist" numFound="3" start="0" maxScore="2.8401346"> What puzzles me is why the result with duplicates doesn't group the same way. Clearly, this result shows it does work without duplicates involved. Is there an explanation for why this is the case?
        Hide
        Steve Rowe added a comment -

        Rob,

        I see what you mean: ngroups should be the same as the number of groups. This problem still happens when I don't request highlighting, so it appears to be a problem with grouping only.

        This may be related to a known issue - from the group.ngroups description on http://wiki.apache.org/solr/FieldCollapsing:

        WARNING: If this parameter is set to true on a sharded environment, all the documents that belong to the same group have to be located in the same shard, otherwise the count will be incorrect. If you are using SolrCloud, consider using "custom hashing"

        I'll do some more testing.

        Show
        Steve Rowe added a comment - Rob, I see what you mean: ngroups should be the same as the number of groups. This problem still happens when I don't request highlighting, so it appears to be a problem with grouping only. This may be related to a known issue - from the group.ngroups description on http://wiki.apache.org/solr/FieldCollapsing : WARNING: If this parameter is set to true on a sharded environment, all the documents that belong to the same group have to be located in the same shard, otherwise the count will be incorrect. If you are using SolrCloud, consider using "custom hashing" I'll do some more testing.
        Hide
        Rob Tulloh added a comment -

        Thanks for reminding to look back at the documentation. I agree that the duplicates are likely the cause of the unexpected counts because the duplicates live on different shards. The field collapsing wiki does document this limitation.

        Show
        Rob Tulloh added a comment - Thanks for reminding to look back at the documentation. I agree that the duplicates are likely the cause of the unexpected counts because the duplicates live on different shards. The field collapsing wiki does document this limitation.

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Steve Rowe
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development