Solr
  1. Solr
  2. SOLR-42

Highlighting problems with HTMLStripWhitespaceTokenizerFactory

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: highlighter
    • Labels:
      None

      Description

      Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).

      Example title field:

      <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia

      Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).

      Response from Yonik on the solr-user mailing-list:

      HTMLStripWhitespaceTokenizerFactory works in two phases...
      HTMLStripReader removes the HTML and passes the result to
      WhitespaceTokenizer... at that point, Tokens are generated, but the
      offsets will correspond to the text after HTML removal, not before.

      I did it this way so that HTMLStripReader could go before any
      tokenizer (like StandardTokenizer).

      Can you open a JIRA bug for this? The fix would be a special version
      of HTMLStripReader integrated with a WhitespaceTokenizer to keep
      offsets correct.

      1. htmlStripReaderTest.html
        13 kB
        Grant Ingersoll
      2. HTMLStripReaderTest.java
        2 kB
        Grant Ingersoll
      3. HtmlStripReaderTestXmlProcessing.patch
        2 kB
        Chris Harris
      4. HtmlStripReaderTestXmlProcessing.patch
        1 kB
        Chris Harris
      5. SOLR-42.patch
        18 kB
        Grant Ingersoll
      6. SOLR-42.patch
        3 kB
        Grant Ingersoll
      7. SOLR-42.patch
        15 kB
        Grant Ingersoll
      8. SOLR-42.patch
        5 kB
        Grant Ingersoll
      9. TokenPrinter.java
        2 kB
        Chris Harris

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          Suggestion from Mirko on solr-dev: change HTMLStripReader to replace striped HTML with equal length whitespace.

          (this could possibly be made a constructor option)

          Show
          Hoss Man added a comment - Suggestion from Mirko on solr-dev: change HTMLStripReader to replace striped HTML with equal length whitespace. (this could possibly be made a constructor option)
          Hide
          Paul Fryer added a comment -

          Any update on this? I was hoping to use Solr to do full text search of HTML documents and provide snippets with search terms highlighted. I assume I won't be able to provide accurate HTML snippet highlighting until this issue is resolved - correct? If so, what sort of time frame would be be looking at? Anything I can contribute to help?

          Show
          Paul Fryer added a comment - Any update on this? I was hoping to use Solr to do full text search of HTML documents and provide snippets with search terms highlighted. I assume I won't be able to provide accurate HTML snippet highlighting until this issue is resolved - correct? If so, what sort of time frame would be be looking at? Anything I can contribute to help?
          Hide
          Hoss Man added a comment -

          the most helpful contribution would be a patch containing a JUnit test demonstrating hte problem and the necessary fix

          http://wiki.apache.org/solr/HowToContribute

          Show
          Hoss Man added a comment - the most helpful contribution would be a patch containing a JUnit test demonstrating hte problem and the necessary fix http://wiki.apache.org/solr/HowToContribute
          Hide
          J.J. Larrea added a comment - - edited

          Here is the workaround I am using, along with a long comment explaining why:

          solrconfig.xml
          		<!--
          			Special-case stuff for HTML tags in Abstract field:
          			Originally we had
          				<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
          			but pre-stripping destroys offsets needed for highiighting.
          			Tried an HTML tag-extraction RegEx as a post-process
          				<tokenizer class="solr.WhitespaceTokenizerFactory"/>
          				<filter class="solr.PatternReplaceFilterFactory"
          					pattern="&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;"
          					replacement=""
          					replace="all"/>
          			but it still doesn't adjust the offset and the subsequent WDF then created havoc.
          			One solution is to split on whitespace or tag delimiters (making tags
          			into text), and either index the tags or use StopFilter to remove 'em.
          			But the chosen solution is to swallow an entire chain of tags and any whitespace
          			which surrounds or separates them, leaving non-HTML < and > intact, or else runs
          			of whitespace as normal.
          
          		-->
          			<tokenizer class="solr.PatternTokenizerFactory"
          					pattern="(?:\s*&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;\s*)++|\s+"/>
          			<filter class="solr.ISOLatin1AccentFilterFactory"/>
          			...
          

          without the XMLEncoding the RegEx is:

          (?:\s*</?\w+((\s+\w+(\s*=\s*(?:"?&"'.?'|[^'">\s]))?)\s*|\s*)/?>\s*)+|\s

          and it will treat runs of "things that look like HTML/XML open or close tags with optional attributes, optionally preceded or followed by spaces" identically to "runs of one or more spaces" as token delimiters, and swallow them up, so the previous and following tokens have the correct offsets.

          Of course this is just a hack: It doesn't have any real understanding of HTML or XML syntax, so something invalid like </closing attr="x"/> will get matched. On the other hand, < and > in text will be left alone.

          Also note it doesn't decode XML or HTML numeric or symbolic entity references, as HTMLStripReader does (my indexer is pre-decoding the entity references before sending the text to Solr for indexing).

          So fixing HTMLStripReader and its dependent HTMLStripXXXTokenizers to do the right thing with offsets would still be a worthy task. I wonder whether recasting HTMLStripReader using the org.apache.lucene.analysis.standard.CharStream interface would make sense for this?

          Show
          J.J. Larrea added a comment - - edited Here is the workaround I am using, along with a long comment explaining why: solrconfig.xml <!-- Special- case stuff for HTML tags in Abstract field: Originally we had <tokenizer class= "solr.HTMLStripWhitespaceTokenizerFactory" /> but pre-stripping destroys offsets needed for highiighting. Tried an HTML tag-extraction RegEx as a post-process <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.PatternReplaceFilterFactory" pattern= "&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;" replacement="" replace= "all" /> but it still doesn't adjust the offset and the subsequent WDF then created havoc. One solution is to split on whitespace or tag delimiters (making tags into text), and either index the tags or use StopFilter to remove 'em. But the chosen solution is to swallow an entire chain of tags and any whitespace which surrounds or separates them, leaving non-HTML < and > intact, or else runs of whitespace as normal. --> <tokenizer class= "solr.PatternTokenizerFactory" pattern= "(?:\s*&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;\s*)++|\s+" /> <filter class= "solr.ISOLatin1AccentFilterFactory" /> ... without the XMLEncoding the RegEx is: (?:\s*</?\w+((\s+\w+(\s*=\s*(?:" ?&"'. ?'| [^'">\s] ))?) \s*|\s*)/?>\s*)+ |\s and it will treat runs of "things that look like HTML/XML open or close tags with optional attributes, optionally preceded or followed by spaces" identically to "runs of one or more spaces" as token delimiters, and swallow them up, so the previous and following tokens have the correct offsets. Of course this is just a hack: It doesn't have any real understanding of HTML or XML syntax, so something invalid like </closing attr="x"/> will get matched. On the other hand, < and > in text will be left alone. Also note it doesn't decode XML or HTML numeric or symbolic entity references, as HTMLStripReader does (my indexer is pre-decoding the entity references before sending the text to Solr for indexing). So fixing HTMLStripReader and its dependent HTMLStripXXXTokenizers to do the right thing with offsets would still be a worthy task. I wonder whether recasting HTMLStripReader using the org.apache.lucene.analysis.standard.CharStream interface would make sense for this?
          Hide
          Stuart Sierra added a comment -

          Another workaround is simply to strip HTML tags and entities before sending the document to Solr. This gives correct highlights. But if you want to retrieve the original HTML document you'll need to stash it somewhere else.

          Show
          Stuart Sierra added a comment - Another workaround is simply to strip HTML tags and entities before sending the document to Solr. This gives correct highlights. But if you want to retrieve the original HTML document you'll need to stash it somewhere else.
          Hide
          Grant Ingersoll added a comment -

          Might a solution to this be to replace the removed characters with whitespace, thus preserving the input offsets?

          Show
          Grant Ingersoll added a comment - Might a solution to this be to replace the removed characters with whitespace, thus preserving the input offsets?
          Hide
          Grant Ingersoll added a comment -

          Here's the start of a test for the HTMLStripReader. Note, in order to incorporate my suggestion of replacing HTML tags w/ whitespace, this test would need to be modified.

          Show
          Grant Ingersoll added a comment - Here's the start of a test for the HTMLStripReader. Note, in order to incorporate my suggestion of replacing HTML tags w/ whitespace, this test would need to be modified.
          Hide
          Grant Ingersoll added a comment - - edited

          Patch that replaces HTML cruft with whitespace, thus preserving highlighting at the expense of extra characters. Also included the a test file and test case

          NOTE: This is not backward-compatible with old HTMLStripReader, but seeing how it requires re-indexing to use anyway, I don't see that it is a concern.

          Show
          Grant Ingersoll added a comment - - edited Patch that replaces HTML cruft with whitespace, thus preserving highlighting at the expense of extra characters. Also included the a test file and test case NOTE: This is not backward-compatible with old HTMLStripReader, but seeing how it requires re-indexing to use anyway, I don't see that it is a concern.
          Hide
          Grant Ingersoll added a comment -

          Hmm, I don't know if this completely solves the problem even though the code seems like it is right. I am still having trouble w/ the offsets.

          has anyone else tried the patch?

          Show
          Grant Ingersoll added a comment - Hmm, I don't know if this completely solves the problem even though the code seems like it is right. I am still having trouble w/ the offsets. has anyone else tried the patch?
          Hide
          Yonik Seeley added a comment -

          Grant, I'm getting a test failure... did you forget to "svn add" some files?

          <error message="src\test\test-files\htmlStripReaderTest.html (The system ca
          not find the path specified)" type="java.io.FileNotFoundException">java.io.File
          otFoundException: src\test\test-files\htmlStripReaderTest.html (The system cann
          t find the path specified)
          at java.io.FileInputStream.open(Native Method)
          at java.io.FileInputStream.<init>(FileInputStream.java:106)
          at java.io.FileReader.<init>(FileReader.java:55)
          at org.apache.solr.analysis.HTMLStripReaderTest.testHTML(HTMLStripReade
          Test.java:65)

          Show
          Yonik Seeley added a comment - Grant, I'm getting a test failure... did you forget to "svn add" some files? <error message="src\test\test-files\htmlStripReaderTest.html (The system ca not find the path specified)" type="java.io.FileNotFoundException">java.io.File otFoundException: src\test\test-files\htmlStripReaderTest.html (The system cann t find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(FileInputStream.java:106) at java.io.FileReader.<init>(FileReader.java:55) at org.apache.solr.analysis.HTMLStripReaderTest.testHTML(HTMLStripReade Test.java:65)
          Hide
          Grant Ingersoll added a comment -

          There is definitely still an issue w/ the HTMLStripReader in some circumstances, working on a test now.

          Show
          Grant Ingersoll added a comment - There is definitely still an issue w/ the HTMLStripReader in some circumstances, working on a test now.
          Hide
          Grant Ingersoll added a comment -

          Here is the file. It shows it as being added in svn stat, but not sure why it wasn't included.

          It goes in src/test/test-files

          It's just a copy of site/index.html

          Show
          Grant Ingersoll added a comment - Here is the file. It shows it as being added in svn stat, but not sure why it wasn't included. It goes in src/test/test-files It's just a copy of site/index.html
          Hide
          Grant Ingersoll added a comment -

          There is definitely still an issue w/ the HTMLStripReader in some circumstances, working on a test now.

          Red herring. The problem I was seeing is due to some tags that are valid syntax for my text are being stripped. I think this patch can go ahead, although it might be useful to keep certain tags (i.e. a set of tags)

          Show
          Grant Ingersoll added a comment - There is definitely still an issue w/ the HTMLStripReader in some circumstances, working on a test now. Red herring. The problem I was seeing is due to some tags that are valid syntax for my text are being stripped. I think this patch can go ahead, although it might be useful to keep certain tags (i.e. a set of tags)
          Hide
          Yonik Seeley added a comment -

          Hmmm, I'm still getting the exception, even after adding the file.
          Did you test with "ant clean test", or with a GUI?

          Show
          Yonik Seeley added a comment - Hmmm, I'm still getting the exception, even after adding the file. Did you test with "ant clean test", or with a GUI?
          Hide
          Yonik Seeley added a comment -

          To elaborate, I think the CWD when running tests under ant is "src\test\test-files", so adding that to the filename when you try to open the file seems incorrect. To match ant, I have intellij set up to use src\test\test-files as the CWD when running tests.

          Show
          Yonik Seeley added a comment - To elaborate, I think the CWD when running tests under ant is "src\test\test-files", so adding that to the filename when you try to open the file seems incorrect. To match ant, I have intellij set up to use src\test\test-files as the CWD when running tests.
          Hide
          Grant Ingersoll added a comment -

          You are right. I forgot that the CWD is different for Solr. Feel
          free to strip off that initial path stuff, or I can generate a new
          patch.

          -Grant

          --------------------------
          Grant Ingersoll
          http://lucene.grantingersoll.com
          http://www.lucenebootcamp.com

          Lucene Helpful Hints:
          http://wiki.apache.org/lucene-java/BasicsOfPerformance
          http://wiki.apache.org/lucene-java/LuceneFAQ

          Show
          Grant Ingersoll added a comment - You are right. I forgot that the CWD is different for Solr. Feel free to strip off that initial path stuff, or I can generate a new patch. -Grant -------------------------- Grant Ingersoll http://lucene.grantingersoll.com http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
          Hide
          Grant Ingersoll added a comment -

          Here's a new version, with some more testing and the ability to preserve certain tags via a passed in set. This is useful for text where some tags are meaningful or at least are useful further down the chain. I am not sure why the html test file is not included. svn stat shows:

          A + src/test/test-files/htmlStripReaderTest.html
          A src/test/org/apache/solr/analysis/HTMLStripReaderTest.java
          M src/java/org/apache/solr/analysis/HTMLStripReader.java

          Show
          Grant Ingersoll added a comment - Here's a new version, with some more testing and the ability to preserve certain tags via a passed in set. This is useful for text where some tags are meaningful or at least are useful further down the chain. I am not sure why the html test file is not included. svn stat shows: A + src/test/test-files/htmlStripReaderTest.html A src/test/org/apache/solr/analysis/HTMLStripReaderTest.java M src/java/org/apache/solr/analysis/HTMLStripReader.java
          Hide
          Yonik Seeley added a comment -

          committed. Thanks Grant!

          Show
          Yonik Seeley added a comment - committed. Thanks Grant!
          Hide
          Grant Ingersoll added a comment -

          Hmmm, still seems to be a problem with entities. I think we need to replace the entity w/ the appropriate amount of whitespace. I will work up test and patch.

          Show
          Grant Ingersoll added a comment - Hmmm, still seems to be a problem with entities. I think we need to replace the entity w/ the appropriate amount of whitespace. I will work up test and patch.
          Hide
          Yonik Seeley added a comment -

          Hmmm, this points out a deficiency in this approach... it could break up words or tokens (with whitespace) that were not originally separated (think international char in the middle of a word).
          So I think this approach is probably OK for now, but a better approach would have the tokenizer get the offsets from the reader somehow (perhaps just a whitespace tokenizer with HTML stripping integrated).

          Show
          Yonik Seeley added a comment - Hmmm, this points out a deficiency in this approach... it could break up words or tokens (with whitespace) that were not originally separated (think international char in the middle of a word). So I think this approach is probably OK for now, but a better approach would have the tokenizer get the offsets from the reader somehow (perhaps just a whitespace tokenizer with HTML stripping integrated).
          Hide
          Grant Ingersoll added a comment -

          OK, here's a patch for the entity problem, but yes, I do agree Yonik, a better approach is needed.

          It might be as simple as the SolrHighlighter being aware the the HTMLStripReader is being used.

          Show
          Grant Ingersoll added a comment - OK, here's a patch for the entity problem, but yes, I do agree Yonik, a better approach is needed. It might be as simple as the SolrHighlighter being aware the the HTMLStripReader is being used.
          Hide
          Mike Klaas added a comment -

          > It might be as simple as the SolrHighlighter being aware the the HTMLStripReader is being used.

          Grant, thanks for looking in to this issue. What do you imagine the highlighter being able to do with that knowledge?

          Note that the entity problem is an issue for search, not just highlighting. The proper tokens will not be created in the case of international chars

          Show
          Mike Klaas added a comment - > It might be as simple as the SolrHighlighter being aware the the HTMLStripReader is being used. Grant, thanks for looking in to this issue. What do you imagine the highlighter being able to do with that knowledge? Note that the entity problem is an issue for search, not just highlighting. The proper tokens will not be created in the case of international chars
          Hide
          Grant Ingersoll added a comment -

          What do you imagine the highlighter being able to do with that knowledge?

          My understanding of looking at the code is the disjoint comes from line 298. In the call to Lucene's highlighter, we pass in the TokenStream, which has been stripped (or will be stripped if the the HTMLStripReader is employed) and the value from the stored field (docTexts[0]). If, docTexts[0] was stripped first, then I think the offsets would be the same, no? Of course, it would be really easy to test.

          Of course, the real answer may be as suggested earlier and to apply the stripreader before sending to Solr.

          Show
          Grant Ingersoll added a comment - What do you imagine the highlighter being able to do with that knowledge? My understanding of looking at the code is the disjoint comes from line 298. In the call to Lucene's highlighter, we pass in the TokenStream, which has been stripped (or will be stripped if the the HTMLStripReader is employed) and the value from the stored field (docTexts [0] ). If, docTexts [0] was stripped first, then I think the offsets would be the same, no? Of course, it would be really easy to test. Of course, the real answer may be as suggested earlier and to apply the stripreader before sending to Solr.
          Hide
          Mike Klaas added a comment -

          > Of course, the real answer may be as suggested earlier and to apply the stripreader before sending to Solr.

          HTMLStripTokenizer currently breaks the tokenizer contract, so it seems like the real answer is to fix the offsets. I've glanced at the code, and it would be a significant amount of work to make the current implementation adhere to this contract. The main problem is that no-one is really interested in doing this work.

          > > What do you imagine the highlighter being able to do with that knowledge?

          > My understanding of looking at the code is the disjoint comes from line 298. In the call to Lucene's highlighter, we pass in the TokenStream, which has > been stripped (or will be stripped if the the HTMLStripReader is employed) and the value from the stored field (docTexts[0]). If, docTexts[0] was stripped first, > then I think the offsets would be the same, no? Of course, it would be really easy to test.

          You know, this is incredibly hacky, but I think that it is a great idea.

          -Mike

          Show
          Mike Klaas added a comment - > Of course, the real answer may be as suggested earlier and to apply the stripreader before sending to Solr. HTMLStripTokenizer currently breaks the tokenizer contract, so it seems like the real answer is to fix the offsets. I've glanced at the code, and it would be a significant amount of work to make the current implementation adhere to this contract. The main problem is that no-one is really interested in doing this work. > > What do you imagine the highlighter being able to do with that knowledge? > My understanding of looking at the code is the disjoint comes from line 298. In the call to Lucene's highlighter, we pass in the TokenStream, which has > been stripped (or will be stripped if the the HTMLStripReader is employed) and the value from the stored field (docTexts [0] ). If, docTexts [0] was stripped first, > then I think the offsets would be the same, no? Of course, it would be really easy to test. You know, this is incredibly hacky, but I think that it is a great idea. -Mike
          Hide
          Hoss Man added a comment -

          I don't know much about unicode, but there are so many special characters in unicode, i just have to wonder if there is a special marker character that could be used instead of whitespace to "fill in the gaps" left when converting entities to real characters (or stripping tags). ... something that isn't printable, and does't trigger any "boundary" logic (ie: note whitespace, punctuation, letter, digit, etc...)

          • NUL perhaps? (can you legally embed null in a string in java?)
          • does anyone understand the definition of a "nonspacing mark" ?
          • the "Invisible Separator" character?
          • a "private use" character? (this actually seems like the most promising option)

          I say we just punt: have two options that allows users to specify characters: one for when tags are striped, one for when entities are converted to normal characters ... default both to an empty string (ie: current behavior)

          Show
          Hoss Man added a comment - I don't know much about unicode, but there are so many special characters in unicode, i just have to wonder if there is a special marker character that could be used instead of whitespace to "fill in the gaps" left when converting entities to real characters (or stripping tags). ... something that isn't printable, and does't trigger any "boundary" logic (ie: note whitespace, punctuation, letter, digit, etc...) NUL perhaps? (can you legally embed null in a string in java?) does anyone understand the definition of a "nonspacing mark" ? the "Invisible Separator" character? a "private use" character? (this actually seems like the most promising option) I say we just punt: have two options that allows users to specify characters: one for when tags are striped, one for when entities are converted to normal characters ... default both to an empty string (ie: current behavior)
          Hide
          Yonik Seeley added a comment -

          i just have to wonder if there is a special marker character that could be used instead of whitespace

          For a hack, not a bad idea...
          There could be a TokenFilter that removes any such characters in tokens, and it could even be automatically used by Tokenizers that use the html strip reader.

          Show
          Yonik Seeley added a comment - i just have to wonder if there is a special marker character that could be used instead of whitespace For a hack, not a bad idea... There could be a TokenFilter that removes any such characters in tokens, and it could even be automatically used by Tokenizers that use the html strip reader.
          Hide
          Hoss Man added a comment -

          Hmmm.... I was assuming this would be an option on both the HTMLStripReader and the Tokenizers that use it (the tokenizers taking the option only to pass it on to the Reader) but i see what you mean ... once the Tokenizer knows the character positions of the "words" coming out of the text, it can then strip out those characters (Hmmm... strip the characters that are a placeholder for when other characters where already striped ... why do i feel like we're going to go to hell for this?)

          Tf there were characters that we were certain would never appear in any unicode string, we could do it all under the covers by picking one of them ... but the safest thing to do would still be to have it as an option (but with a sensible default from the "private use" range instead of an empty string). ...

          So the HtmlStripReader would have a constructor that looks like...

             /** 
             * @param entityFiller character to replace gaps made when entities are collapsed to real characters so that character positions still line up, may be null if no filler should be used 
             * @param tagFiller character to replace gaps made when entities are collapsed to real characters so that character positions still line up, may be null if no filler should be used 
             */
            public HtmlStripReader(Reader input, Character entityFiller, Character tagFiller) { ... }
          

          and the Tokenizers could look like...

          public class HTMLStripStandardTokenizerFactory extends BaseTokenizerFactory {
            Pattern fillerPattern;
            Character entityFiller, tagFiller; 
            public void init(...) {
              entityFiller = getInitParam(...);
              tagFiller = getInitParam(...)
              fillerPattern = getInitParam(stripFiller) ? makePattern(entityFiller, tagFiller) : null;
            }
            public TokenStream create(Reader input) {
              TokenStream s = new StandardTokenizer(new HTMLStripReader(input,entityFiller, tagFiller);
              If (null != fillerPatterm) {
                s = new PatternReplaceFiler(s, fillerPattern, "", true);
              }
              return s;
            }
          }
          

          that should totally work right?

          Show
          Hoss Man added a comment - Hmmm.... I was assuming this would be an option on both the HTMLStripReader and the Tokenizers that use it (the tokenizers taking the option only to pass it on to the Reader) but i see what you mean ... once the Tokenizer knows the character positions of the "words" coming out of the text, it can then strip out those characters (Hmmm... strip the characters that are a placeholder for when other characters where already striped ... why do i feel like we're going to go to hell for this?) Tf there were characters that we were certain would never appear in any unicode string, we could do it all under the covers by picking one of them ... but the safest thing to do would still be to have it as an option (but with a sensible default from the "private use" range instead of an empty string). ... So the HtmlStripReader would have a constructor that looks like... /** * @param entityFiller character to replace gaps made when entities are collapsed to real characters so that character positions still line up, may be null if no filler should be used * @param tagFiller character to replace gaps made when entities are collapsed to real characters so that character positions still line up, may be null if no filler should be used */ public HtmlStripReader(Reader input, Character entityFiller, Character tagFiller) { ... } and the Tokenizers could look like... public class HTMLStripStandardTokenizerFactory extends BaseTokenizerFactory { Pattern fillerPattern; Character entityFiller, tagFiller; public void init(...) { entityFiller = getInitParam(...); tagFiller = getInitParam(...) fillerPattern = getInitParam(stripFiller) ? makePattern(entityFiller, tagFiller) : null ; } public TokenStream create(Reader input) { TokenStream s = new StandardTokenizer( new HTMLStripReader(input,entityFiller, tagFiller); If ( null != fillerPatterm) { s = new PatternReplaceFiler(s, fillerPattern, "", true ); } return s; } } that should totally work right?
          Hide
          Grant Ingersoll added a comment - - edited

          Yet another problem:
          In certain circumstances, it is possible that restoreState() cannot be invoked b/c the mark has been lost due to moving well beyond it. This is most noticeable in the while (true) loop inside of readProcessingInstruction() and can be caused by the following test:

          public void testQuestionMark() throws Exception {
              StringBuilder testBuilder = new StringBuilder(5020);
              testBuilder.append("ah<?> ");
              for (int i = 0; i < 5000; i++){
                testBuilder.append('a');//tack on enough to go beyond the mark readahead limit, since <?> makes HTMLStripReader think it is a processing instruction
              }
              String test = testBuilder.toString();
              Reader reader = new HTMLStripReader(new BufferedReader(new StringReader(test)));//force the use of BufferedReader
              int ch = 0;
              StringBuilder builder = new StringBuilder();
              try {
                while ((ch = reader.read()) != -1){
                  builder.append((char)ch);
                }
              } finally {
                System.out.println("String: " + builder.toString());
              }
              assertTrue(builder.toString() + " is not equal to " + test, builder.toString().equals(test) == true);
            }
          

          In this case, the final assert never gets hit, because there is an IOException in reader.read of:

          java.io.IOException: Mark invalid
          	at java.io.BufferedReader.reset(BufferedReader.java:485)
          	at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:158)
          	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:731)
          	at org.apache.solr.analysis.HTMLStripReaderTest.testQuestionMark(HTMLStripReaderTest.java:171)
          

          I will add that this is based off of text seen in the wild.

          Show
          Grant Ingersoll added a comment - - edited Yet another problem: In certain circumstances, it is possible that restoreState() cannot be invoked b/c the mark has been lost due to moving well beyond it. This is most noticeable in the while (true) loop inside of readProcessingInstruction() and can be caused by the following test: public void testQuestionMark() throws Exception { StringBuilder testBuilder = new StringBuilder(5020); testBuilder.append( "ah<?> " ); for ( int i = 0; i < 5000; i++){ testBuilder.append('a'); //tack on enough to go beyond the mark readahead limit, since <?> makes HTMLStripReader think it is a processing instruction } String test = testBuilder.toString(); Reader reader = new HTMLStripReader( new BufferedReader( new StringReader(test))); //force the use of BufferedReader int ch = 0; StringBuilder builder = new StringBuilder(); try { while ((ch = reader.read()) != -1){ builder.append(( char )ch); } } finally { System .out.println( " String : " + builder.toString()); } assertTrue(builder.toString() + " is not equal to " + test, builder.toString().equals(test) == true ); } In this case, the final assert never gets hit, because there is an IOException in reader.read of: java.io.IOException: Mark invalid at java.io.BufferedReader.reset(BufferedReader.java:485) at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:158) at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:731) at org.apache.solr.analysis.HTMLStripReaderTest.testQuestionMark(HTMLStripReaderTest.java:171) I will add that this is based off of text seen in the wild.
          Hide
          Grant Ingersoll added a comment -

          OK, I think the solution for these pieces is to change the
          while (true) loops inside of HTMLStripReader to only read up to READAHEAD chars so that it can always fall back to the last mark

          I will work up a patch.

          Show
          Grant Ingersoll added a comment - OK, I think the solution for these pieces is to change the while (true) loops inside of HTMLStripReader to only read up to READAHEAD chars so that it can always fall back to the last mark I will work up a patch.
          Hide
          Grant Ingersoll added a comment -

          OK, I'm 99.99% confident this fixes the issues w/ the while (true) loops and handles entities properly, etc. and highlighting works correctly. Added more tests, etc.

          Show
          Grant Ingersoll added a comment - OK, I'm 99.99% confident this fixes the issues w/ the while (true) loops and handles entities properly, etc. and highlighting works correctly. Added more tests, etc.
          Hide
          Erik Hatcher added a comment -

          patch applied, all tests still pass, and committed. Thanks Grant!

          Show
          Erik Hatcher added a comment - patch applied, all tests still pass, and committed. Thanks Grant!
          Hide
          Chris Harris added a comment -

          The committed HtmlStripReader doesn't seem to handle offsets correctly for XML processing instructions such as this:

          <?xml version="1.0" encoding="UTF-8" ?>

          I'm attaching two files:

          HtmlStripReaderTestXmlProcessing.patch adds an HtmlStripReader test case to catch the problem. (The test currently fails.)

          TokenPrinter.java can help make it a little clearer what the problem actually is. Here is the output if I run it against against the analysis code in trunk. Note that the offsets are basically what one would expect, except in the XML processing instructions case, where the start position is off by one:

          -------------------------------------
          String to test: <uniqueKey>id</uniqueKey>
          Token info:
          token 'id'
          startOffset: 11
          char at startOffset, and next few: 'id</u'
          -------------------------------------
          String to test: <!-- Unless this field is marked with required="false", it will be a required field -->
          <uniqueKey>id</uniqueKey>
          Token info:
          token 'id'
          startOffset: 99
          char at startOffset, and next few: 'id</u'
          -------------------------------------
          String to test: <!-- And now: two elements --> <element1>one</element1>
          <element2>two</element2>
          Token info:
          token 'one'
          startOffset: 41
          char at startOffset, and next few: 'one</'
          token 'two'
          startOffset: 68
          char at startOffset, and next few: 'two</'
          -------------------------------------
          String to test: <?xml version="1.0" encoding="UTF-8" ?><uniqueKey>id</uniqueKey>
          Token info:
          token 'id'
          startOffset: 49
          char at startOffset, and next few: '>id</'
          -------------------------------------

          Show
          Chris Harris added a comment - The committed HtmlStripReader doesn't seem to handle offsets correctly for XML processing instructions such as this: <?xml version="1.0" encoding="UTF-8" ?> I'm attaching two files: HtmlStripReaderTestXmlProcessing.patch adds an HtmlStripReader test case to catch the problem. (The test currently fails.) TokenPrinter.java can help make it a little clearer what the problem actually is. Here is the output if I run it against against the analysis code in trunk. Note that the offsets are basically what one would expect, except in the XML processing instructions case, where the start position is off by one: ------------------------------------- String to test: <uniqueKey>id</uniqueKey> Token info: token 'id' startOffset: 11 char at startOffset, and next few: 'id</u' ------------------------------------- String to test: <!-- Unless this field is marked with required="false", it will be a required field --> <uniqueKey>id</uniqueKey> Token info: token 'id' startOffset: 99 char at startOffset, and next few: 'id</u' ------------------------------------- String to test: <!-- And now: two elements --> <element1>one</element1> <element2>two</element2> Token info: token 'one' startOffset: 41 char at startOffset, and next few: 'one</' token 'two' startOffset: 68 char at startOffset, and next few: 'two</' ------------------------------------- String to test: <?xml version="1.0" encoding="UTF-8" ?><uniqueKey>id</uniqueKey> Token info: token 'id' startOffset: 49 char at startOffset, and next few: '>id</' -------------------------------------
          Hide
          Chris Harris added a comment -

          Updating test case to reflect the fact that offset info still gets screwed up in the XML processing instruction case even if there are no XML elements in the source XML

          Show
          Chris Harris added a comment - Updating test case to reflect the fact that offset info still gets screwed up in the XML processing instruction case even if there are no XML elements in the source XML
          Hide
          solrize added a comment -

          I am getting a ton of these errors in real documents with the 2008-07-30 nightly build. Any advice? Thanks.

          Show
          solrize added a comment - I am getting a ton of these errors in real documents with the 2008-07-30 nightly build. Any advice? Thanks.
          Hide
          Jim Murphy added a comment -

          I've tracked down some background info on this issue - at least the way it was affecting me. I could care less about highlighting - I'm using the HTMLStripWhitespaceTokenizerFactory during indexing to tokenize blog content - which obviously contains lots of html.

          The pathological case I've found with our input document set is:

          Content contains a malformed xml processing instruction in the first "page" of the buffer that contains more than one page of data.

          It seems this is a fairly common (maybe MS Word XML?) form of invalid HTML. Commonly it looks like this:

          ...valid html...<?xml:namespace prefix = o />...valid html...

          Notice the PI starts with "<?xml" but terminates with a close tag., doh.

          This issue is manifested in HTMLStripReader. It causes the following code to read too much off the buffer and invalidates the previous mark at the beginning of the tag.

          private int readProcessingInstruction() throws IOException {
          // "<?" has already been read
          while ((numRead - lastMark) < readAheadLimitMinus1) {
          int ch = next();
          if (ch=='?' && peek()=='>')

          { next(); return MATCH; }

          else if (ch==-1)

          { return MISMATCH; }

          }
          return MISMATCH;
          }

          The demoralizing part is the special treatment (readAheadLimitMinus1) isn't enough. There is actually a "over read" by 2 chars.

          The IOException - Invalid Mark happens when readProcessingInstruction() retuns (a mismatch because the entire buffer is read without finding the close PI) and restoreState(); is called to reset the marks - which fails.

          If I tweak readAheadLimitMinus1 like this

          readAheadLimitMinus1 -= 2

          So maybe the variable should be readAheadLimitMinus3

          then the buffer limits are preserved and the exception isn't found, parsing proceedes as expected.

          Jim

          Show
          Jim Murphy added a comment - I've tracked down some background info on this issue - at least the way it was affecting me. I could care less about highlighting - I'm using the HTMLStripWhitespaceTokenizerFactory during indexing to tokenize blog content - which obviously contains lots of html. The pathological case I've found with our input document set is: Content contains a malformed xml processing instruction in the first "page" of the buffer that contains more than one page of data. It seems this is a fairly common (maybe MS Word XML?) form of invalid HTML. Commonly it looks like this: ...valid html...<?xml:namespace prefix = o />...valid html... Notice the PI starts with "<?xml" but terminates with a close tag., doh. This issue is manifested in HTMLStripReader. It causes the following code to read too much off the buffer and invalidates the previous mark at the beginning of the tag. private int readProcessingInstruction() throws IOException { // "<?" has already been read while ((numRead - lastMark) < readAheadLimitMinus1) { int ch = next(); if (ch=='?' && peek()=='>') { next(); return MATCH; } else if (ch==-1) { return MISMATCH; } } return MISMATCH; } The demoralizing part is the special treatment (readAheadLimitMinus1) isn't enough. There is actually a "over read" by 2 chars. The IOException - Invalid Mark happens when readProcessingInstruction() retuns (a mismatch because the entire buffer is read without finding the close PI) and restoreState(); is called to reset the marks - which fails. If I tweak readAheadLimitMinus1 like this readAheadLimitMinus1 -= 2 So maybe the variable should be readAheadLimitMinus3 then the buffer limits are preserved and the exception isn't found, parsing proceedes as expected. Jim
          Hide
          Lance Norskog added a comment -

          A slight nit: Unicode contains three characters, fffD, fffE, and fffF, which are not supposed to appear in any document. One (fffD i think) is the offical "not a character" character, and should be used for the purpose of separating terms.

          Show
          Lance Norskog added a comment - A slight nit: Unicode contains three characters, fffD, fffE, and fffF, which are not supposed to appear in any document. One (fffD i think) is the offical "not a character" character, and should be used for the purpose of separating terms.
          Hide
          Lance Norskog added a comment -

          Another bigger nit: the last time I studied this, the HTML stripper code did not pull alt-text fields as searchable tags.

          Show
          Lance Norskog added a comment - Another bigger nit: the last time I studied this, the HTML stripper code did not pull alt-text fields as searchable tags.
          Hide
          Francisco Sanmartín added a comment -

          The patch is for solr 1.3?

          I tried to apply the patch to solr 1.2 but it fails

          Can anybody complete the "affected version" and "fix version" field? thanks!

          Show
          Francisco Sanmartín added a comment - The patch is for solr 1.3? I tried to apply the patch to solr 1.2 but it fails Can anybody complete the "affected version" and "fix version" field? thanks!
          Hide
          solrize added a comment -

          This problem is still present in solr 1.3.0 and I'm getting a ton of those "mark invalid" exceptions. Is this really a minor issue?!!!!

          Show
          solrize added a comment - This problem is still present in solr 1.3.0 and I'm getting a ton of those "mark invalid" exceptions. Is this really a minor issue?!!!!
          Hide
          Mert Sakarya added a comment -

          I think this is a problem of Microsoft Word. No one can say that;

          ...valid html...<?xml:namespace prefix = o />...valid html...

          is a valid HTML. Any HTMLParser should look for a "?>" after a "<?"

          BUT! As a solution, I modified line 644 at HTMLStripReader.java as;

          //if (ch=='?' && peek()=='>') {
          if ((ch=='?' || ch=='/') && peek()=='>') { //This fixes Office Word problem, but might cause other problems!!! Be very careful.

          And created my own HTMLStripReader in another jar file.

          Show
          Mert Sakarya added a comment - I think this is a problem of Microsoft Word. No one can say that; ...valid html...<?xml:namespace prefix = o />...valid html... is a valid HTML. Any HTMLParser should look for a "?>" after a "<?" BUT! As a solution, I modified line 644 at HTMLStripReader.java as; //if (ch=='?' && peek()=='>') { if ((ch=='?' || ch=='/') && peek()=='>') { //This fixes Office Word problem, but might cause other problems!!! Be very careful. And created my own HTMLStripReader in another jar file.
          Hide
          Anders Melchiorsen added a comment -

          I found the issue with entity replacement breaking up words, and created SOLR-1394.

          My patch in that issue is to HTMLStripCharFilter.

          Show
          Anders Melchiorsen added a comment - I found the issue with entity replacement breaking up words, and created SOLR-1394 . My patch in that issue is to HTMLStripCharFilter.
          Hide
          David Smiley added a comment -

          I was just reading this whole thread and SOLR-1394. It seems this issue can be closed as a duplicate of SOLR-1394. No?

          Show
          David Smiley added a comment - I was just reading this whole thread and SOLR-1394 . It seems this issue can be closed as a duplicate of SOLR-1394 . No?
          Hide
          Matthias Pigulla added a comment - - edited

          I don't think it's a duplicate and the issue is still unresolved at least in regard to comment-12625835 and the 1.4.1 release.

          The input string "<??>xx yy xx" will have the start offsets for xx, yy and xx at 3, 6 and 9 respectively and is off by one.

          "<? ?><? ?>xx yy xx" [spaces added between question marks for JIRA display] will even have 6, 9 and 12, that is, every "<??>" (as a special "degenerated" kind of XML PI) will shift the offset by one.

          Show
          Matthias Pigulla added a comment - - edited I don't think it's a duplicate and the issue is still unresolved at least in regard to comment-12625835 and the 1.4.1 release. The input string "<??>xx yy xx" will have the start offsets for xx, yy and xx at 3, 6 and 9 respectively and is off by one. "<? ?><? ?>xx yy xx" [spaces added between question marks for JIRA display] will even have 6, 9 and 12, that is, every "<??>" (as a special "degenerated" kind of XML PI) will shift the offset by one.
          Hide
          Steve Rowe added a comment -

          Fixed by LUCENE-3690.

          Show
          Steve Rowe added a comment - Fixed by LUCENE-3690 .

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Andrew May
            • Votes:
              4 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development