Lucene - Core
  1. Lucene - Core
  2. LUCENE-2208

Token div exceeds length of provided text sized 4114

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/highlighter
    • Labels:
      None
    • Environment:

      diagnostics =

      {os.version=5.1, os=Windows XP, lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=flush, os.arch=x86, java.version=1.6.0_12, java.vendor=Sun Microsystems Inc.}
    • Lucene Fields:
      New

      Description

      I have a doc which contains html codes. I want to strip html tags and make the test clear after then apply highlighter on the clear text . But highlighter throws an exceptions if I strip out the html characters , if i don't strip out , it works fine. It just confuses me at the moment

      I copy paste 3 thing here from the console as it may contain special characters which might cause the problem.

      1 -) Here is the html text

      <h2>Starter</h2>
      <div id="tab1-content" class="tabContent selected">
      <div class="head"></div>
      <div class="body">
      <div class="subject-header">Learning path: History</div>
      <h3>Key question</h3>
      <p>Did transport fuel the industrial revolution?</p>
      <h3>Learning Objective</h3>
      <ul>
      <li>To categorise points as for or against an argument</li>
      </ul>
      <p>
      <h3>What to do?</h3>
      <ul>
      <li>Watch the clip: <em>Transport fuelled the industrial revolution.</em></li>
      </ul>
      <p>The clips claims that transport fuelled the industrial revolution. Some historians argue that the industrial revolution only happened because of developments in transport.</p>
      <ul>
      <li>Read the statements below and decide which points are <em>for</em> and which points are <em>against</em> the argument that industry expanded in the 18th and 19th centuries because of developments in transport.</li>
      </ul>

      <ol type="a">
      <li>Industry expanded because of inventions and the discovery of steam power.</li>
      <li>Improvements in transport allowed goods to be sold all over the country and all over the world so there were more customers to develop industry for.</li>
      <li>Developments in transport allowed resources, such as coal from mines and cotton from America to come together to manufacture products.</li>
      <li>Transport only developed because industry needed it. It was slow to develop as money was spent on improving roads, then building canals and the replacing them with railways in order to keep up with industry.</li>
      </ol>

      <p>Now try to think of 2 more statements of your own.</p>

      </div>
      <div class="foot"></div>
      </div>
      <h2>Main activity</h2>
      <div id="tab2-content" class="tabContent">
      <div class="head"></div>
      <div class="body"><div class="subject-header">Learning path: History</div>
      <h3>Learning Objective</h3>
      <ul>
      <li>To select evidence to support points</li>
      </ul>
      <h3>What to do?</h3>
      <!--<ul>
      <li>Watch the clip: <em>Windmill and water mill</em></li>
      </ul>-->
      <ul><li>Choose the 4 points that you think are most important - try to be balanced by having two <strong>for</strong> and two <strong>against</strong>.</li>
      <li>Write one in each of the point boxes of the paragraphs on the sheet <a href="lp_history_industry_transport_ws1.html" class="link-internal">Constructing a balanced argument</a>.</li></ul> <p>You might like to re write the points in your own words and use connectives to link the paragraphs.</p>

      <p>In history and in any argument, you need evidence to support your points.</p>
      <ul><li>Find evidence from these sources and from your own knowledge to support each of your points:</li></ul>
      <ol>
      <li><a href="../servlet/link?template=vid&macro=setResource&resourceID=2044" class="link-internal">At a toll gate</a></li>
      <li><a href="../servlet/link?macro=setResource&template=vid&resourceID=2046" class="link-internal">Canals</a></li>
      <li><a href="../servlet/link?macro=setResource&template=vid&resourceID=2043" class="link-internal">Growing cities: traffic</a></li>
      <li><a href="../servlet/link?macro=setResource&template=vid&resourceID=2047" class="link-internal">Impact of the railway</a> </li>
      <li><a href="../servlet/link?macro=setResource&template=vid&resourceID=2048" class="link-internal">Sailing ships</a> </li>
      <li><a href="../servlet/link?macro=setResource&template=vid&resourceID=2050" class="link-internal">Liverpool: Capital of Culture</a> </li>
      </ol>
      <p>Try to be specific in your evidence - use named examples of places or people. Use dates if you can.</p>
      </div>
      <div class="foot"></div>
      </div>
      <h2>Plenary</h2>
      <div id="tab3-content" class="tabContent">
      <div class="head"></div>
      <div class="body"><div class="subject-header">Learning path: History</div>
      <h3>Learning Objective</h3>
      <ul>
      <li>To judge which of the arguments is most valid</li>
      </ul>
      <h3>What to do?</h3>
      <!-- <ul>
      <li>Watch the clip: <em>Food of the rich</em></li>
      </ul>-->
      <p>In order to be a good historian, and get good marks in exams, you need to show your evaluation skills and make a judgement. Having been through the evidence which point do you think is most important? Why? Is there more evidence? Is the evidence more convincing?</p>
      <ul><li>In the final box on your worksheet write a conclusion explaining whether on balance the evidence is enough to convince you that transport fuelled the industrial revolution.</li></ul>
      </div>
      <div class="foot"></div>
      </div>
      <h2>Extension</h2>
      <div id="tab4-content" class="tabContent">
      <div class="head"></div>
      <div class="body"><div class="subject-header">Learning path: History</div>
      <h3>What to do?</h3>
      <p>Watch the clip <em>Stress in a ski resort</em></p>
      <p>New industries, such as tourism, can now be said to be fuelled by transport improvements.</p>
      <ul><li>Search Clipbank, using the Related clip lists as well as the search function, to find examples from around the world of how transport has helped industry.</li></ul>
      </div>
      <div class="foot"></div>
      </div>

      2-) here is the text after stripped html tags out

      Starter

      Learning path: History
      Key question
      Did transport fuel the industrial revolution?
      Learning Objective

      To categorise points as for or against an argument

      What to do?

      Watch the clip: Transport fuelled the industrial revolution.

      The clips claims that transport fuelled the industrial revolution. Some historians argue that the industrial revolution only happened because of developments in transport.

      Read the statements below and decide which points are for and which points are against the argument that industry expanded in the 18th and 19th centuries because of developments in transport.

      Industry expanded because of inventions and the discovery of steam power.
      Improvements in transport allowed goods to be sold all over the country and all over the world so there were more customers to develop industry for.
      Developments in transport allowed resources, such as coal from mines and cotton from America to come together to manufacture products.
      Transport only developed because industry needed it. It was slow to develop as money was spent on improving roads, then building canals and the replacing them with railways in order to keep up with industry.

      Now try to think of 2 more statements of your own.

      Main activity

      Learning path: History
      Learning Objective

      To select evidence to support points

      What to do?

      Choose the 4 points that you think are most important - try to be balanced by having two for and two against .
      Write one in each of the point boxes of the paragraphs on the sheet Constructing a balanced argument . You might like to re write the points in your own words and use connectives to link the paragraphs.

      In history and in any argument, you need evidence to support your points.
      Find evidence from these sources and from your own knowledge to support each of your points:

      At a toll gate
      Canals
      Growing cities: traffic
      Impact of the railway
      Sailing ships
      Liverpool: Capital of Culture

      Try to be specific in your evidence - use named examples of places or people. Use dates if you can.

      Plenary

      Learning path: History
      Learning Objective

      To judge which of the arguments is most valid

      What to do?

      In order to be a good historian, and get good marks in exams, you need to show your evaluation skills and make a judgement. Having been through the evidence which point do you think is most important? Why? Is there more evidence? Is the evidence more convincing?
      In the final box on your worksheet write a conclusion explaining whether on balance the evidence is enough to convince you that transport fuelled the industrial revolution.

      Extension

      Learning path: History
      What to do?
      Watch the clip Stress in a ski resort
      New industries, such as tourism, can now be said to be fuelled by transport improvements.
      Search Clipbank, using the Related clip lists as well as the search function, to find examples from around the world of how transport has helped industry.

      3-) here is the exception I get

      org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token div exceeds length of provided text sized 4114
      at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
      at org.apache.lucene.search.highlight.Highlighter.getBestFragments(Highlighter.java:158)
      at org.apache.lucene.search.highlight.Highlighter.getBestFragments(Highlighter.java:462)

      1. LUCENE-2208_test.patch
        1 kB
        Robert Muir
      2. LUCENE-2208.patch
        6 kB
        Hsiu Wang

        Issue Links

          Activity

          Hide
          Luke Forehand added a comment -

          I just opened a bug for what appears to be the same issue in the SOLR project:

          https://issues.apache.org/jira/browse/SOLR-1883

          There you will find an attachment of the document that I am attempting to query with a highlight query but it fails. I also pasted my schema.xml and the exception stacktrace.

          Show
          Luke Forehand added a comment - I just opened a bug for what appears to be the same issue in the SOLR project: https://issues.apache.org/jira/browse/SOLR-1883 There you will find an attachment of the document that I am attempting to query with a highlight query but it fails. I also pasted my schema.xml and the exception stacktrace.
          Hide
          Luke Forehand added a comment -

          Sorry for what seems like cross posting, but I'm not sure if the issue is in SOLR or Lucene. I have uploaded to SOLR-1883 an easy to execute test case that consistently reproduces this bug. I also confirm (like the original reporter of this issue) that removing the HTMLStripCharFilterFactory during indexing fixes the highlight query problem, but obviously we need to strip the HTML so this isn't a solution.

          Show
          Luke Forehand added a comment - Sorry for what seems like cross posting, but I'm not sure if the issue is in SOLR or Lucene. I have uploaded to SOLR-1883 an easy to execute test case that consistently reproduces this bug. I also confirm (like the original reporter of this issue) that removing the HTMLStripCharFilterFactory during indexing fixes the highlight query problem, but obviously we need to strip the HTML so this isn't a solution.
          Hide
          Robert Muir added a comment -

          Definitely looks like a bug in HTMLStripCharFilter.

          attached is a simple test case demonstrating the bug

          Show
          Robert Muir added a comment - Definitely looks like a bug in HTMLStripCharFilter. attached is a simple test case demonstrating the bug
          Hide
          Hsiu Wang added a comment - - edited

          added patch(LUCENE-2208.patch) to fix org.apache.lucene.search.highlight.InvalidTokenOffsetsException.

          The exception is caused by HTML escape characters (e.g., &#38;, &amp; ) which are counted as 1 character in text.length() in Highlighter.getBestTextFragments, but in HTMLStripCharfilter, they are counted as N characters(&#38; counted as 5).

          In the patch, I commented out an incorrect test case in HTMLStripCharFilterTest.testOffset()("X & X ( X < > X"). The commented out test case is covered by Robert's test patch.

          Show
          Hsiu Wang added a comment - - edited added patch( LUCENE-2208 .patch) to fix org.apache.lucene.search.highlight.InvalidTokenOffsetsException. The exception is caused by HTML escape characters (e.g., &#38;, &amp; ) which are counted as 1 character in text.length() in Highlighter.getBestTextFragments, but in HTMLStripCharfilter, they are counted as N characters(&#38; counted as 5). In the patch, I commented out an incorrect test case in HTMLStripCharFilterTest.testOffset()("X & X ( X < > X"). The commented out test case is covered by Robert's test patch.
          Hide
          Robert Muir added a comment -

          Hi Hsiu, thanks for uploading a patch to the issue.

          Koji, maybe you can take a look at this when you get a chance?
          It seems to be an important bug to fix, but I don't know the HtmlStripCharFilter that well.

          Show
          Robert Muir added a comment - Hi Hsiu, thanks for uploading a patch to the issue. Koji, maybe you can take a look at this when you get a chance? It seems to be an important bug to fix, but I don't know the HtmlStripCharFilter that well.
          Hide
          Ahmet Arslan added a comment -

          Hello, I am using very recent trunk, I received the same exception (InvalidTokenOffsetsException) with PatternReplaceCharFilter. I observed that HTMLStripCharFilter sometimes causes wrong words getting highlighted. So I was playing with PatternReplaceCharFilter to somehow remove html tags hoping highlighting won't be broken this time.

          I remember tokenizer versions of htmlStrip has problems with highlighting. It seems that it is continued with charFilters. Hsiu Wang, do you think the reason (HTMLStripCharFilter causes wrong words getting highlighted) is the same as here (what you explained here)?

          Show
          Ahmet Arslan added a comment - Hello, I am using very recent trunk, I received the same exception (InvalidTokenOffsetsException) with PatternReplaceCharFilter. I observed that HTMLStripCharFilter sometimes causes wrong words getting highlighted. So I was playing with PatternReplaceCharFilter to somehow remove html tags hoping highlighting won't be broken this time. I remember tokenizer versions of htmlStrip has problems with highlighting. It seems that it is continued with charFilters. Hsiu Wang, do you think the reason (HTMLStripCharFilter causes wrong words getting highlighted) is the same as here (what you explained here)?
          Hide
          Vadim Kisselmann added a comment - - edited

          same problem here with solr 4.0, nightly build "apache-solr-4.0-2011-08-03_08-19-32".
          interestingly, when i delete the sort params, the query works.

          xxx.:8983/solr/clustering&rows=40&start=0&fl=content+pubDate&sort=pubDate+desc&q=london

          2 Log-snippets:
          Http - 500 Internal Server Error
          Error: Carrot2 clustering failed........

          And this was caused by:
          Http - 500 Internal Server Error
          Error: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token the exceeds length of provided text sized 41

          Show
          Vadim Kisselmann added a comment - - edited same problem here with solr 4.0, nightly build "apache-solr-4.0-2011-08-03_08-19-32". interestingly, when i delete the sort params, the query works. xxx.:8983/solr/clustering&rows=40&start=0&fl=content+pubDate&sort=pubDate+desc&q=london 2 Log-snippets: Http - 500 Internal Server Error Error: Carrot2 clustering failed........ And this was caused by: Http - 500 Internal Server Error Error: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token the exceeds length of provided text sized 41
          Hide
          Matan Zinger added a comment -

          Hello Guys,

          I am blocked with bug as well.

          Is there any update / progress on this subject?

          Thank you in advance...

          Show
          Matan Zinger added a comment - Hello Guys, I am blocked with bug as well. Is there any update / progress on this subject? Thank you in advance...
          Hide
          Mark Rosenberg added a comment -

          Same problem here with 3.4. Removing HTMLStripCharFilterFactory solves the problem. Turns out I don't need this filter anyway, as I remove the HTML markup when my data import handler pulls the data from an Sqlite3 DB.

          Show
          Mark Rosenberg added a comment - Same problem here with 3.4. Removing HTMLStripCharFilterFactory solves the problem. Turns out I don't need this filter anyway, as I remove the HTML markup when my data import handler pulls the data from an Sqlite3 DB.
          Hide
          Vadim Kisselmann added a comment -

          Added patch doesn't work, unfortunately. Removing HTMLStripCharFilterFactory solves the problem but is not a solution. I need this Filter.

          Show
          Vadim Kisselmann added a comment - Added patch doesn't work, unfortunately. Removing HTMLStripCharFilterFactory solves the problem but is not a solution. I need this Filter.
          Hide
          Robert Muir added a comment -

          This filter has at least 2 bugs:

          • wrong end-offsets that sometimes go past the end of the actual text
          • wrong final offsets that are sometimes shorter than the actual text

          I committed a (disabled) random test exposing these issues.

          Show
          Robert Muir added a comment - This filter has at least 2 bugs: wrong end-offsets that sometimes go past the end of the actual text wrong final offsets that are sometimes shorter than the actual text I committed a (disabled) random test exposing these issues.
          Hide
          Steve Rowe added a comment - - edited

          The JFlex-based HTMLStripCharFilter replacement at LUCENE-3690 should fix the offset problems reported here. (It passes the assertLegalOffsets() test in Robert Muir's test patch on this issue, as well as the random test Robert added to HTMLStripCharFilterTest.)

          Show
          Steve Rowe added a comment - - edited The JFlex-based HTMLStripCharFilter replacement at LUCENE-3690 should fix the offset problems reported here. (It passes the assertLegalOffsets() test in Robert Muir's test patch on this issue, as well as the random test Robert added to HTMLStripCharFilterTest .)
          Hide
          Vadim Kisselmann added a comment -

          Hello Steven, thanks, i'll test it.
          is it right:
          i can use furthermore my HTMLStripCharFilterFactory with this patch(LUCENE-3690.patch) with no changes in my schema.xml?

          Show
          Vadim Kisselmann added a comment - Hello Steven, thanks, i'll test it. is it right: i can use furthermore my HTMLStripCharFilterFactory with this patch( LUCENE-3690 .patch) with no changes in my schema.xml?
          Hide
          Steve Rowe added a comment -

          Hi Vadim,

          This patch is against trunk (which will eventually be released as v4.0). Also, it's not yet setup to be a replacement for the existing HTMLStripCharFilter, so even if you apply this patch to trunk, it still won't work for you.

          I will post a patch in the next day or two that you should be able to test; I'll add a note here when it's ready.

          Show
          Steve Rowe added a comment - Hi Vadim, This patch is against trunk (which will eventually be released as v4.0). Also, it's not yet setup to be a replacement for the existing HTMLStripCharFilter, so even if you apply this patch to trunk, it still won't work for you. I will post a patch in the next day or two that you should be able to test; I'll add a note here when it's ready.
          Hide
          Vadim Kisselmann added a comment -

          Hi Steven,
          ok, thanks for your feedback
          It would be nice if you could post a patch, i'm blocked a little bit with this bug.
          Cheers Vadim

          Show
          Vadim Kisselmann added a comment - Hi Steven, ok, thanks for your feedback It would be nice if you could post a patch, i'm blocked a little bit with this bug. Cheers Vadim
          Hide
          Steve Rowe added a comment -

          Fixed by LUCENE-3690.

          Show
          Steve Rowe added a comment - Fixed by LUCENE-3690 .

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Ramazan VARLIKLI
            • Votes:
              5 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development