Solr
  1. Solr
  2. SOLR-1731

ArrayIndexOutOfBoundsException when highlighting

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: highlighter
    • Labels:
      None

      Description

      I'm seeing an java.lang.ArrayIndexOutOfBoundsException when trying to highlight for certain queries. The error seems to be an issue with the combination of the ShingleFilterFactory, PositionFilterFactory and the LengthFilterFactory.

      Here's my fieldType definition:

      <fieldType name="textSku" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
      <analyzer type="index">
      <tokenizer class="solr.KeywordTokenizerFactory" />
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      <filter class="solr.LengthFilterFactory" min="2" max="100"/>
      </analyzer>
      <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory" />
      <filter class="solr.ShingleFilterFactory" maxShingleSize="8" outputUnigrams="true"/>
      <filter class="solr.PositionFilterFactory" />
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      <filter class="solr.LengthFilterFactory" min="2" max="100"/> <!-- works if this is commented out -->
      </analyzer>
      </fieldType>

      Here's the field definition:

      <field name="sku_new" type="textSku" indexed="true" stored="true" omitNorms="true"/>

      Here's a sample doc:

      <add>
      <doc>
      <field name="id">1</field>
      <field name="sku_new">A 1280 C</field>
      </doc>
      </add>

      Doing a query for sku_new:"A 1280 C" and requesting highlighting throws the exception (full stack trace below):

      http://localhost:8983/solr/select/?q=sku_new%3A%22A+1280+C%22&version=2.2&start=0&rows=10&indent=on&&hl=on&hl.fl=sku_new&fl=*

      If I comment out the LengthFilterFactory from my query analyzer section everything seems to work. Commenting out just the PositionFilterFactory also makes the exception go away and seems to work for this specific query.

      Full stack trace:

      java.lang.ArrayIndexOutOfBoundsException: -1
      at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:202)
      at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:414)
      at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216)
      at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184)
      at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:226)
      at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335)
      at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
      at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
      at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
      at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
      at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
      at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
      at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
      at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
      at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
      at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
      at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
      at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
      at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
      at org.mortbay.jetty.Server.handle(Server.java:285)
      at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
      at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
      at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
      at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
      at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
      at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
      at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

        Activity

        Hide
        Koji Sekiguchi added a comment -

        Can't you use WhitespaceTokenizer for index?

        Show
        Koji Sekiguchi added a comment - Can't you use WhitespaceTokenizer for index?
        Hide
        Tim Underwood added a comment -

        I don't think that gives me what I want. I'm indexing skus that might be in various formats (I'll usually only know about 1 or 2 of the formats). e.g.:

        A 1280 C
        A 12 80 C
        A12 80C
        A1280C
        A-1280-C
        A.1280.C

        As far as my application cares, those are all equivalent and should just be indexed as:

        a1280c

        On the query side of things I want to match any of the formats listed above plus stuff like:

        Foo Bar A 12 80 C
        Foo Bar A 1280 C
        Foo Bar A1280C

        My current setup seems to do a pretty good job of matching. If I use the WhitespaceTokenizer for index (and disable the LengthFilterFactory) then I end up with different terms being indexed depending on the format of the sku:

        A 1280 C => a, 1280, c
        A12 80C => a12, 80c
        A 12 80 C => a, 12, 80, c

        Show
        Tim Underwood added a comment - I don't think that gives me what I want. I'm indexing skus that might be in various formats (I'll usually only know about 1 or 2 of the formats). e.g.: A 1280 C A 12 80 C A12 80C A1280C A-1280-C A.1280.C As far as my application cares, those are all equivalent and should just be indexed as: a1280c On the query side of things I want to match any of the formats listed above plus stuff like: Foo Bar A 12 80 C Foo Bar A 1280 C Foo Bar A1280C My current setup seems to do a pretty good job of matching. If I use the WhitespaceTokenizer for index (and disable the LengthFilterFactory) then I end up with different terms being indexed depending on the format of the sku: A 1280 C => a, 1280, c A12 80C => a12, 80c A 12 80 C => a, 12, 80, c
        Hide
        Koji Sekiguchi added a comment -

        So why don't you uni-gram on both index and query for sku field?

        <fieldType name="text_1g" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
                <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>
        

        As far as my application cares, those are all equivalent and should just be indexed as:

        a1280c

        To eliminate space/period/hyphen, mapping.txt would look like:

        " " => ""
        "." => ""
        "-" => ""
        
        Show
        Koji Sekiguchi added a comment - So why don't you uni-gram on both index and query for sku field? <fieldType name= "text_1g" class= "solr.TextField" positionIncrementGap= "100" > <analyzer type= "index" > <charFilter class= "solr.MappingCharFilterFactory" mapping= "mapping.txt" /> <tokenizer class= "solr.NGramTokenizerFactory" minGramSize= "1" maxGramSize= "1" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type= "query" > <tokenizer class= "solr.NGramTokenizerFactory" minGramSize= "1" maxGramSize= "1" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> </fieldType> As far as my application cares, those are all equivalent and should just be indexed as: a1280c To eliminate space/period/hyphen, mapping.txt would look like: " " => "" "." => "" "-" => ""
        Hide
        Leonhard Maylein added a comment -

        We have the same problem whenever we search for a word which has synonyms defined.

        Show
        Leonhard Maylein added a comment - We have the same problem whenever we search for a word which has synonyms defined.

          People

          • Assignee:
            Unassigned
            Reporter:
            Tim Underwood
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development