Lucene - Core
  1. Lucene - Core
  2. LUCENE-1939

IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9.2, 3.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I tried to use the ShingleMatrixFilter within Solr. To test the functionality etc., I first used the built-in field analysis view.The filter was configured to be used only at query time analysis with "_" as spacer character and a min. and max. shingle size of 2. The generation of the shingles for query strings with this filter seems to work at this view, but by turn on the highlighting of indexed terms that will match the query terms, the exception was thrown. Also, each time I tried to query the index the exception was immediately thrown.

      Stacktrace:

      java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
      	at java.util.ArrayList.RangeCheck(Unknown Source)
      	at java.util.ArrayList.get(Unknown Source)
      	at org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
      	at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
      	at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
      	at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
      	...
      

      Within the hasNext method, there is the s-1-th Column from the ArrayList columns requested, but there isn't this entry within columns.

      I created a patch that checks, if columns contains enough entries.

        Activity

        Hide
        Patrick Jungermann added a comment -

        patch

        Show
        Patrick Jungermann added a comment - patch
        Hide
        Uwe Schindler added a comment -

        Is this caused by the rewrite because of the new TokenStream API?

        Show
        Uwe Schindler added a comment - Is this caused by the rewrite because of the new TokenStream API?
        Hide
        Karl Wettin added a comment -

        Is this caused by the rewrite because of the new TokenStream API?

        Nah, I think it's just a miss in the code never cought before. Not sure though so I'll write a test or two this weekend.

        Show
        Karl Wettin added a comment - Is this caused by the rewrite because of the new TokenStream API? Nah, I think it's just a miss in the code never cought before. Not sure though so I'll write a test or two this weekend.
        Hide
        Uwe Schindler added a comment -

        I also think so, because the above stack dump seems to be from 2.4.1 (in 2.9 there should be incrementToken() instead of next() for all filters listed there).

        Show
        Uwe Schindler added a comment - I also think so, because the above stack dump seems to be from 2.4.1 (in 2.9 there should be incrementToken() instead of next() for all filters listed there).
        Hide
        Karl Wettin added a comment -

        I also think so, because the above stack dump seems to be from 2.4.1 (in 2.9 there should be incrementToken() instead of next() for all filters listed there).

        Ah, I missunderstood your comment. The thing is that ShingleMatrixFilter was left using the old API because of its complexity. I told whoever it was that gave it a shot that I'd look in to upgrading it, just haven't had time to do so yet. There will be a new generified and updated version of the filter any year now. At least before 3.0.

        Show
        Karl Wettin added a comment - I also think so, because the above stack dump seems to be from 2.4.1 (in 2.9 there should be incrementToken() instead of next() for all filters listed there). Ah, I missunderstood your comment. The thing is that ShingleMatrixFilter was left using the old API because of its complexity. I told whoever it was that gave it a shot that I'd look in to upgrading it, just haven't had time to do so yet. There will be a new generified and updated version of the filter any year now. At least before 3.0.
        Hide
        Uwe Schindler added a comment - - edited

        Michael Busch and me updated it It is now even more optimized and clones more seldom.

        edit

        Sorry the more optimized one is the NGram filter. This one is still not the best, because it still uses Token and is not aware of custom attributes, that may also need to be shingled. We left this in because of compatibility (lots of public API using Token instead of raw attribute interfaces).

        Show
        Uwe Schindler added a comment - - edited Michael Busch and me updated it It is now even more optimized and clones more seldom. edit Sorry the more optimized one is the NGram filter. This one is still not the best, because it still uses Token and is not aware of custom attributes, that may also need to be shingled. We left this in because of compatibility (lots of public API using Token instead of raw attribute interfaces).
        Hide
        Robert Muir added a comment -

        Michael Busch and me updated it It is now even more optimized and clones more seldom.

        Uwe, are you sure you do not refer to ShingleFilter (versus ShingleMatrixFilter)? I think things are different for this one.

        Show
        Robert Muir added a comment - Michael Busch and me updated it It is now even more optimized and clones more seldom. Uwe, are you sure you do not refer to ShingleFilter (versus ShingleMatrixFilter)? I think things are different for this one.
        Hide
        Uwe Schindler added a comment -

        Yes you are right, I updated/fixed ShingleFilter and Michael updated ShingleMatrix. But the NGram is also more optimized...

        Show
        Uwe Schindler added a comment - Yes you are right, I updated/fixed ShingleFilter and Michael updated ShingleMatrix. But the NGram is also more optimized...
        Hide
        Karl Wettin added a comment -

        Patrick,

        I can't manage to reproduce this error. Uwe is right though, you are getting this error using 2.4.1 or earlier, not by using 2.9.

        at org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)

        Can you please try with 2.9? It would also be very helpful if you could list the applicable Solr configurations and some example data you are passing to the filter when it's thrown.

        Thanks in advance.

        Show
        Karl Wettin added a comment - Patrick, I can't manage to reproduce this error. Uwe is right though, you are getting this error using 2.4.1 or earlier, not by using 2.9. at org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729) Can you please try with 2.9? It would also be very helpful if you could list the applicable Solr configurations and some example data you are passing to the filter when it's thrown. Thanks in advance.
        Hide
        Patrick Jungermann added a comment -

        Karl, your right, sorry. I used the current release of Solr, version 1.3.0, that's using Lucene 2.4.1. Solr 1.4 that will be released soon is using Lucene 2.9. For me, it seems that filter did not changed at the causing code lines. But I don't know, if this is the real root cause.

        Now, I have tested this also with the current trunk of Solr that is already using Lucene 2.9. At first I tried a simple example with an analyzing workflow based on the WhitespaceTokenizer followed by the ShingleMatrixFilter and no problem occured.

        Then, I tried the other field type configuration, that I had used at the former test and the exception was thrown.

        Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        	at java.util.ArrayList.RangeCheck(Unknown Source)
        	at java.util.ArrayList.get(Unknown Source)
        	at org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:841)
        	at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.produceNextToken(ShingleMatrixFilter.java:485)
        	at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.incrementToken(ShingleMatrixFilter.java:372)
        	at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:401)
        	at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:405)
        	...
        

        To find the reason of it, I removed filter by filter. After a lot of tests, I found out that the problem was caused by the use of

        1. WhitespaceTokenizer
        2. ShingleMatrixFilter
        3. RemoveDuplicatesTokenFilter
          that were used in that order. If I changed the positions of both filters, all seems to work okay.

        This time, I tested this only with the field analysis view with different data

        Also, it was really strange, that the exception only occured at the first analysis request, and extremely rarly a second time. But it was thrown at every first request.

        Show
        Patrick Jungermann added a comment - Karl, your right, sorry. I used the current release of Solr, version 1.3.0, that's using Lucene 2.4.1. Solr 1.4 that will be released soon is using Lucene 2.9. For me, it seems that filter did not changed at the causing code lines. But I don't know, if this is the real root cause. Now, I have tested this also with the current trunk of Solr that is already using Lucene 2.9. At first I tried a simple example with an analyzing workflow based on the WhitespaceTokenizer followed by the ShingleMatrixFilter and no problem occured. Then, I tried the other field type configuration, that I had used at the former test and the exception was thrown. Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.RangeCheck(Unknown Source) at java.util.ArrayList.get(Unknown Source) at org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:841) at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.produceNextToken(ShingleMatrixFilter.java:485) at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.incrementToken(ShingleMatrixFilter.java:372) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:401) at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:405) ... To find the reason of it, I removed filter by filter. After a lot of tests, I found out that the problem was caused by the use of WhitespaceTokenizer ShingleMatrixFilter RemoveDuplicatesTokenFilter that were used in that order. If I changed the positions of both filters, all seems to work okay. This time, I tested this only with the field analysis view with different data Also, it was really strange, that the exception only occured at the first analysis request, and extremely rarly a second time. But it was thrown at every first request.
        Hide
        Karl Wettin added a comment -

        The exception is thrown when ts#next (incrementToken) is called again after already having returned null (false) once. So this is a nice catch!

        But this means that RemoveDuplicatesTokenFilter in Solr calls incrementToken one extra time for some reason. Can you please post the complete stacktrace so I can take a look in there too?

        I suppose the expected behaviour would be that a token stream keep returning false when incrementToken is called upon after returning false already, but the javadocs doesn't really say anything about this, nor is there a generic test case that ensure this for all filters. Thus this error might be available in other filters. I'll see if I can do something about that before committing.

        Thanks for the report Patrick!

        Show
        Karl Wettin added a comment - The exception is thrown when ts#next (incrementToken) is called again after already having returned null (false) once. So this is a nice catch! But this means that RemoveDuplicatesTokenFilter in Solr calls incrementToken one extra time for some reason. Can you please post the complete stacktrace so I can take a look in there too? I suppose the expected behaviour would be that a token stream keep returning false when incrementToken is called upon after returning false already, but the javadocs doesn't really say anything about this, nor is there a generic test case that ensure this for all filters. Thus this error might be available in other filters. I'll see if I can do something about that before committing. Thanks for the report Patrick!
        Hide
        Patrick Jungermann added a comment -

        Here is the complete stacktrace:

        java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        
        org.apache.jasper.JasperException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        	at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:402)
        	at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:464)
        	at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:358)
        	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487)
        	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:367)
        	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
        	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
        	at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:268)
        	at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:126)
        	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:264)
        	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
        	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
        	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
        	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
        	at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
        	at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
        	at org.mortbay.jetty.Server.handle(Server.java:285)
        	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
        	at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
        	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
        	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
        	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
        	at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
        	at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
        Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        	at java.util.ArrayList.RangeCheck(Unknown Source)
        	at java.util.ArrayList.get(Unknown Source)
        	at org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:841)
        	at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.produceNextToken(ShingleMatrixFilter.java:485)
        	at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.incrementToken(ShingleMatrixFilter.java:372)
        	at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:401)
        	at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:405)
        	at org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94)
        	at org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:80)
        	at org.apache.jsp.admin.analysis_jsp.getTokens(org.apache.jsp.admin.analysis_jsp:104)
        	at org.apache.jsp.admin.analysis_jsp._jspService(org.apache.jsp.admin.analysis_jsp:681)
        	at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:80)
        	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        	at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:373)
        	... 29 more
        
        RequestURI=/solr/admin/analysis.jsp
        
        Show
        Patrick Jungermann added a comment - Here is the complete stacktrace: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 org.apache.jasper.JasperException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:402) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:464) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:358) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:367) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:268) at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:126) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:264) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.RangeCheck(Unknown Source) at java.util.ArrayList.get(Unknown Source) at org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:841) at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.produceNextToken(ShingleMatrixFilter.java:485) at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.incrementToken(ShingleMatrixFilter.java:372) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:401) at org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:405) at org.apache.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94) at org.apache.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:80) at org.apache.jsp.admin.analysis_jsp.getTokens(org.apache.jsp.admin.analysis_jsp:104) at org.apache.jsp.admin.analysis_jsp._jspService(org.apache.jsp.admin.analysis_jsp:681) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:80) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:373) ... 29 more RequestURI=/solr/admin/analysis.jsp
        Hide
        Karl Wettin added a comment -

        Committed in 821888.

        Thanks Patrick!

        (I'll consider the other stuff mentioned in the issue later this week, and if managable then as a new issue.)

        Show
        Karl Wettin added a comment - Committed in 821888. Thanks Patrick! (I'll consider the other stuff mentioned in the issue later this week, and if managable then as a new issue.)
        Hide
        Uwe Schindler added a comment -

        Committed into 2.9 branch revision: 899681

        Show
        Uwe Schindler added a comment - Committed into 2.9 branch revision: 899681

          People

          • Assignee:
            Karl Wettin
            Reporter:
            Patrick Jungermann
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development