Lucene - Core
  1. Lucene - Core
  2. LUCENE-3690

JFlex-based HTMLStripCharFilter replacement

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.5, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      A JFlex-based HTMLStripCharFilter replacement would be more performant and easier to understand and maintain.

      1. BaselineWarcTest.java
        3 kB
        Steve Rowe
      2. HTMLStripCharFilterWarcTest.java
        4 kB
        Steve Rowe
      3. jenkins_test.patch
        1 kB
        Robert Muir
      4. JFlexHTMLStripCharFilterWarcTest.java
        4 kB
        Steve Rowe
      5. LUCENE-3690.patch
        2.41 MB
        Steve Rowe
      6. LUCENE-3690.patch
        2.36 MB
        Steve Rowe
      7. LUCENE-3690.patch
        924 kB
        Steve Rowe
      8. LUCENE-3690.patch
        230 kB
        Steve Rowe
      9. LUCENE-3690.patch
        229 kB
        Steve Rowe
      10. LUCENE-3690-handle-utf16-surrogates.patch
        13 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Steve Rowe added a comment -

          Backported to branch_3x.

          Show
          Steve Rowe added a comment - Backported to branch_3x.
          Hide
          Steve Rowe added a comment -

          Thanks Robert for your help diagnosing and fixing the surrogates problem!

          Show
          Steve Rowe added a comment - Thanks Robert for your help diagnosing and fixing the surrogates problem!
          Hide
          Steve Rowe added a comment -

          Committed the fixed UTF-16 numeric character reference surrogates handling to trunk in r1234687.

          Show
          Steve Rowe added a comment - Committed the fixed UTF-16 numeric character reference surrogates handling to trunk in r1234687.
          Hide
          Steve Rowe added a comment - - edited

          Patch (excluding the re-generated .java scanner) that addresses the unpaired surrogate numeric character entity failures uncovered by random testing, by outputting REPLACEMENT CHARACTER U+FFFD, and adds the ability to interpret properly paired UTF-16 surrogates as an above-BMP codepoint. Added tests to cover all four combinations of hex & decimal surrogate numeric character entities in surrogate pairs.

          Also added @SuppressWarnings("fallthrough") to the JFlex-generated scanner class, so that the 40+ warnings about switch case fall-throughs don't clutter the output.

          Edit: committing to trunk shortly.

          Show
          Steve Rowe added a comment - - edited Patch (excluding the re-generated .java scanner) that addresses the unpaired surrogate numeric character entity failures uncovered by random testing, by outputting REPLACEMENT CHARACTER U+FFFD, and adds the ability to interpret properly paired UTF-16 surrogates as an above-BMP codepoint. Added tests to cover all four combinations of hex & decimal surrogate numeric character entities in surrogate pairs. Also added @SuppressWarnings("fallthrough") to the JFlex-generated scanner class, so that the 40+ warnings about switch case fall-throughs don't clutter the output. Edit : committing to trunk shortly.
          Hide
          Robert Muir added a comment -

          By the way: "expected" in that test is wrong. its just the same input string to trigger the assert in MockTokenizer...

          Show
          Robert Muir added a comment - By the way: "expected" in that test is wrong. its just the same input string to trigger the assert in MockTokenizer...
          Hide
          Robert Muir added a comment -

          Here's a standalone testcase for the fail from jenkins

          Show
          Robert Muir added a comment - Here's a standalone testcase for the fail from jenkins
          Hide
          Steve Rowe added a comment -

          Here is the final patch.

          sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation

          right, this is what we did with DateField a while back, note the CHANGES.txt entry on r658003. now that we have luceneMatchVersion though i kind of go back and forth on when to use it to pick an impl vs when to do stuff like this. dealers choice...

          https://svn.apache.org/viewvc?view=revision&revision=658003

          I took the same approach - here are the changes from the previous version of the patch:

          1. The previous HTMLStripCharFilter implementation is moved to Solr, renamed to LegacyHTMLStripCharFilter, and deprecated, and a Factory is added for it.
          2. JFlexHTMLStripCharFilter is renamed to HTMLStripCharFilter.
          3. Support for HTMLStripCharFilter's "escapedTags" functionality is added to HTMLStripCharFilterFactory.
          4. Added TestHTMLStripCharFilterFactory.
          5. Solr and Lucene CHANGES.txt entries are added.

          Run the following svn copy script before applying the patch:

          svn cp modules/analysis/common/src/java/org/apache/lucene/analysis/charfilter/HTMLStripCharFilter.java solr/core/src/java/org/apache/solr/analysis/LegacyHTMLStripCharFilter.java
          svn cp modules/analysis/common/src/test/org/apache/lucene/analysis/charfilter/htmlStripReaderTest.html solr/core/src/test/org/apache/solr/analysis/
          svn cp modules/analysis/common/src/test/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterTest.java solr/core/src/test/org/apache/solr/analysis/LegacyHTMLStripCharFilterTest.java
          svn cp solr/core/src/java/org/apache/solr/analysis/HTMLStripCharFilterFactory.java solr/core/src/java/org/apache/solr/analysis/LegacyHTMLStripCharFilterFactory.java
          

          I plan to commit to trunk shortly, then backport and commit to branch_3x.

          Show
          Steve Rowe added a comment - Here is the final patch. sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation right, this is what we did with DateField a while back, note the CHANGES.txt entry on r658003. now that we have luceneMatchVersion though i kind of go back and forth on when to use it to pick an impl vs when to do stuff like this. dealers choice... https://svn.apache.org/viewvc?view=revision&revision=658003 I took the same approach - here are the changes from the previous version of the patch: The previous HTMLStripCharFilter implementation is moved to Solr, renamed to LegacyHTMLStripCharFilter , and deprecated, and a Factory is added for it. JFlexHTMLStripCharFilter is renamed to HTMLStripCharFilter . Support for HTMLStripCharFilter 's "escapedTags" functionality is added to HTMLStripCharFilterFactory . Added TestHTMLStripCharFilterFactory . Solr and Lucene CHANGES.txt entries are added. Run the following svn copy script before applying the patch: svn cp modules/analysis/common/src/java/org/apache/lucene/analysis/charfilter/HTMLStripCharFilter.java solr/core/src/java/org/apache/solr/analysis/LegacyHTMLStripCharFilter.java svn cp modules/analysis/common/src/test/org/apache/lucene/analysis/charfilter/htmlStripReaderTest.html solr/core/src/test/org/apache/solr/analysis/ svn cp modules/analysis/common/src/test/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterTest.java solr/core/src/test/org/apache/solr/analysis/LegacyHTMLStripCharFilterTest.java svn cp solr/core/src/java/org/apache/solr/analysis/HTMLStripCharFilterFactory.java solr/core/src/java/org/apache/solr/analysis/LegacyHTMLStripCharFilterFactory.java I plan to commit to trunk shortly, then backport and commit to branch_3x.
          Hide
          Steve Rowe added a comment -

          AFAICT, SOLR-2891 will be fixed by this implementation.

          I misspoke, having misread that issue - despite the reference to HTMLStripCharFilter in the most recent comment on the issue, SOLR-2891 is not about HTMLStripCharFilter.

          Show
          Steve Rowe added a comment - AFAICT, SOLR-2891 will be fixed by this implementation. I misspoke, having misread that issue - despite the reference to HTMLStripCharFilter in the most recent comment on the issue, SOLR-2891 is not about HTMLStripCharFilter .
          Hide
          Hoss Man added a comment -

          +1 ... to everything.

          Show
          Hoss Man added a comment - +1 ... to everything.
          Hide
          Steve Rowe added a comment -

          some test docs using typical wellformed html markup

          I have access to ClueWeb09. For performance testing I used the first WARC file for the English and Chinese languages (en0000/00.warc.gz and zh0000/00.warc.gz), each of which when uncompressed contains about 1GB of text (including a small amount of non-HTML metadata: WARC information and HTTP headers). The English WARC contains about 35,000 documents from about 2,100 unique domains. The Chinese WARC contains about 33,000 documents from about 550 unique domains.

          I compared JFlexHTMLStripCharFilter's output with that of HTMLStripCharFilter for several hundred documents. In the course of this comparison, I found several problems with the JFlex implementation (e.g. no <STYLE> tag handling; no MS conditional tag handling, e.g. <![if ! IE]>; and some problems handling creative attribute values), which the attached patch fixes. I re-ran the text-only and malformed HTML performance tests on the final implementation, and the numbers aren't significantly different from those prior to these fixes. The new patch also contains the more-evil _TestUtils.randomHtmlishString(); shifts the CharFilter javadocs from BaseCharFilter.addOffCorrectMapping() to package.html; and adds several more tests to JFlexHTMLStripCharFilterTest.java.

          I have attached the three classes I used to test performance over the ClueWeb09 subset. BaselineWarcTest.java uses the WarcRecord class supplied with the ClueWeb09 collection to read the compressed WARC files; looks for a declared charset first in each document's content in the Content-Type <meta> tag, and then in the HTTP header; feeds this charset, if any, to the ICU4J charset detector, which instantiates a Reader using the detected charset; and then read()'s all content. The other two classes add the respective CharFilter on top of BaselineWarcTest's functionality.

          The performance numbers (best of 5 trials):

          Language Baseline Classic JFlex
          English 156s 179s 171s
          Chinese 155s 180s 172s

          Excluding charset detection and I/O (measured by BaselineWarcTest), JFlexHTMLStripCharFilter appears to improve on HTMLStripCharFilter's throughput by about 50% in both languages.

          I found a few problems with HTMLStripCharFilter:

          1. The following exception was thrown for six of the English documents:
            java.io.IOException: Mark invalid
                    at java.io.BufferedReader.reset(BufferedReader.java:485)
                    at org.apache.lucene.analysis.CharReader.reset(CharReader.java:69)
                    at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:171)
                    at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734)
                    at HTMLStripCharFilterWarcTest.main(HTMLStripCharFilterWarcTest.java:86)
            
          2. &apos; is not decoded.
          3. Content between some <script> tags is not stripped out.
          4. Unbalanced quotation marks in opening tags cause the tag to not be stripped out.

          Left to do:

          1. Rename HTMLStripCharFilter to ClassicHTMLStripCharFilter; move it to Solr o.a.s.analysis package; deprecate it; and create a new Solr Factory for it.
          2. Rename JFlexHTMLStripCharFilter to HTMLStripCharFilter.
          3. Commit to trunk
          4. Backport and commit to branch_3x.
          Show
          Steve Rowe added a comment - some test docs using typical wellformed html markup I have access to ClueWeb09 . For performance testing I used the first WARC file for the English and Chinese languages ( en0000/00.warc.gz and zh0000/00.warc.gz ), each of which when uncompressed contains about 1GB of text (including a small amount of non-HTML metadata: WARC information and HTTP headers). The English WARC contains about 35,000 documents from about 2,100 unique domains. The Chinese WARC contains about 33,000 documents from about 550 unique domains. I compared JFlexHTMLStripCharFilter 's output with that of HTMLStripCharFilter for several hundred documents. In the course of this comparison, I found several problems with the JFlex implementation (e.g. no <STYLE> tag handling; no MS conditional tag handling, e.g. <![if ! IE]> ; and some problems handling creative attribute values), which the attached patch fixes. I re-ran the text-only and malformed HTML performance tests on the final implementation, and the numbers aren't significantly different from those prior to these fixes. The new patch also contains the more-evil _TestUtils.randomHtmlishString() ; shifts the CharFilter javadocs from BaseCharFilter.addOffCorrectMapping() to package.html ; and adds several more tests to JFlexHTMLStripCharFilterTest.java . I have attached the three classes I used to test performance over the ClueWeb09 subset. BaselineWarcTest.java uses the WarcRecord class supplied with the ClueWeb09 collection to read the compressed WARC files; looks for a declared charset first in each document's content in the Content-Type <meta> tag, and then in the HTTP header; feeds this charset, if any, to the ICU4J charset detector, which instantiates a Reader using the detected charset; and then read()'s all content. The other two classes add the respective CharFilter on top of BaselineWarcTest 's functionality. The performance numbers (best of 5 trials): Language Baseline Classic JFlex English 156s 179s 171s Chinese 155s 180s 172s Excluding charset detection and I/O (measured by BaselineWarcTest ), JFlexHTMLStripCharFilter appears to improve on HTMLStripCharFilter 's throughput by about 50% in both languages. I found a few problems with HTMLStripCharFilter : The following exception was thrown for six of the English documents: java.io.IOException: Mark invalid at java.io.BufferedReader.reset(BufferedReader.java:485) at org.apache.lucene.analysis.CharReader.reset(CharReader.java:69) at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.restoreState(HTMLStripCharFilter.java:171) at org.apache.lucene.analysis.charfilter.HTMLStripCharFilter.read(HTMLStripCharFilter.java:734) at HTMLStripCharFilterWarcTest.main(HTMLStripCharFilterWarcTest.java:86) &apos; is not decoded. Content between some <script> tags is not stripped out. Unbalanced quotation marks in opening tags cause the tag to not be stripped out. Left to do: Rename HTMLStripCharFilter to ClassicHTMLStripCharFilter ; move it to Solr o.a.s.analysis package; deprecate it; and create a new Solr Factory for it. Rename JFlexHTMLStripCharFilter to HTMLStripCharFilter. Commit to trunk Backport and commit to branch_3x.
          Hide
          Steve Rowe added a comment - - edited

          some test docs that contain almost no HTML at all

          I used the following to test the zero-markup case in both test classes (the JFlexHTMLStripCharFilter version is given below):

            public void testRandomText() throws Exception {
              StringBuilder text = new StringBuilder();
              int minNumWords = 10;
              int maxNumWords = 10000;
              int minWordLength = 3;
              int maxWordLength = 20;
              int numWords = _TestUtil.nextInt(random, minNumWords, maxNumWords);
              switch (_TestUtil.nextInt(random, 0, 4)) {
                case 0: {
                  for (int wordNum = 0 ; wordNum < numWords ; ++wordNum) {
                    text.append(_TestUtil.randomUnicodeString(random, maxWordLength));
                    text.append(' ');
                  }
                  break;
                }
                case 1: {
                  for (int wordNum = 0 ; wordNum < numWords ; ++wordNum) {
                    text.append(_TestUtil.randomRealisticUnicodeString
                        (random, minWordLength, maxWordLength));
                    text.append(' ');
                  }
                  break;
                }
                default: { // ASCII 50% of the time
                  for (int wordNum = 0 ; wordNum < numWords ; ++wordNum) {
                    text.append(_TestUtil.randomSimpleString(random));
                    text.append(' ');
                  }
                }
              }
              Reader reader = new JFlexHTMLStripCharFilter
                  (CharReader.get(new StringReader(text.toString())));
              while (reader.read() != -1);
            }
          

          The results for 10K iterations (best/worst of 5):

          • HTMLStripCharFilter: best: 23.7 sec, worst: 24.8 sec
          • JFlexHTMLStripCharFilter: best: 22.0 sec, worst: 24.7 sec
          Show
          Steve Rowe added a comment - - edited some test docs that contain almost no HTML at all I used the following to test the zero-markup case in both test classes (the JFlexHTMLStripCharFilter version is given below): public void testRandomText() throws Exception { StringBuilder text = new StringBuilder(); int minNumWords = 10; int maxNumWords = 10000; int minWordLength = 3; int maxWordLength = 20; int numWords = _TestUtil.nextInt(random, minNumWords, maxNumWords); switch (_TestUtil.nextInt(random, 0, 4)) { case 0: { for ( int wordNum = 0 ; wordNum < numWords ; ++wordNum) { text.append(_TestUtil.randomUnicodeString(random, maxWordLength)); text.append(' '); } break ; } case 1: { for ( int wordNum = 0 ; wordNum < numWords ; ++wordNum) { text.append(_TestUtil.randomRealisticUnicodeString (random, minWordLength, maxWordLength)); text.append(' '); } break ; } default : { // ASCII 50% of the time for ( int wordNum = 0 ; wordNum < numWords ; ++wordNum) { text.append(_TestUtil.randomSimpleString(random)); text.append(' '); } } } Reader reader = new JFlexHTMLStripCharFilter (CharReader.get( new StringReader(text.toString()))); while (reader.read() != -1); } The results for 10K iterations (best/worst of 5): HTMLStripCharFilter : best: 23.7 sec, worst: 24.8 sec JFlexHTMLStripCharFilter : best: 22.0 sec, worst: 24.7 sec
          Hide
          Steve Rowe added a comment -

          I'm going to increase the evilness of _TestUtil.randomHtmlishString() and re-run to see if the numbers shift.

          I did this, and it uncovered a bug in handling of Server Side Includes in JFlexHTMLStripCharFilter - hooray for evil tests.

          The timings, this time for 10K iterations:

          • HTMLStripCharFilter: best: 48.6 sec, worst: 49.7 sec
          • JFlexHTMLStripCharFilter: best: 15.5 sec, worst: 17.3 sec
          Show
          Steve Rowe added a comment - I'm going to increase the evilness of _TestUtil.randomHtmlishString() and re-run to see if the numbers shift. I did this, and it uncovered a bug in handling of Server Side Includes in JFlexHTMLStripCharFilter - hooray for evil tests. The timings, this time for 10K iterations: HTMLStripCharFilter : best: 48.6 sec, worst: 49.7 sec JFlexHTMLStripCharFilter : best: 15.5 sec, worst: 17.3 sec
          Hide
          Steve Rowe added a comment -

          some test docs using malformed markup that require backtracking

          I set up a quick test for this in both of the test classes using output from the synthetic broken HTML generator o.a.l.util._TestUtil.randomHtmlishString() and ran 100K iterations of it - here's the HTMLStripCharFilterTest version:

            public void testRandomBrokenHTML() throws Exception {
              int maxNumElements = 10000;
              String text = _TestUtil.randomHtmlishString(random, maxNumElements);
              Reader reader
                  = new HTMLStripCharFilter(CharReader.get(new StringReader(text)));
              while (reader.read() != -1);
            }
          

          Best/worst of 5 (as reported by Ant for the individual test, rather than for the entire Ant invocation):

          • HTMLStripCharFilter: best: 73.4 sec, worst: 76.5 sec
          • JFlexHTMLStripCharFilter: best: 73.5 sec, worst: 76.0 sec

          I'm going to increase the evilness of _TestUtil.randomHtmlishString() and re-run to see if the numbers shift.

          Show
          Steve Rowe added a comment - some test docs using malformed markup that require backtracking I set up a quick test for this in both of the test classes using output from the synthetic broken HTML generator o.a.l.util._TestUtil.randomHtmlishString() and ran 100K iterations of it - here's the HTMLStripCharFilterTest version: public void testRandomBrokenHTML() throws Exception { int maxNumElements = 10000; String text = _TestUtil.randomHtmlishString(random, maxNumElements); Reader reader = new HTMLStripCharFilter(CharReader.get( new StringReader(text))); while (reader.read() != -1); } Best/worst of 5 (as reported by Ant for the individual test, rather than for the entire Ant invocation): HTMLStripCharFilter : best: 73.4 sec, worst: 76.5 sec JFlexHTMLStripCharFilter : best: 73.5 sec, worst: 76.0 sec I'm going to increase the evilness of _TestUtil.randomHtmlishString() and re-run to see if the numbers shift.
          Hide
          Hoss Man added a comment -

          sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation

          right, this is what we did with DateField a while back, note the CHANGES.txt entry on r658003. now that we have luceneMatchVersion though i kind of go back and forth on when to use it to pick an impl vs when to do stuff like this. dealers choice...

          https://svn.apache.org/viewvc?view=revision&revision=658003

          Show
          Hoss Man added a comment - sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation right, this is what we did with DateField a while back, note the CHANGES.txt entry on r658003. now that we have luceneMatchVersion though i kind of go back and forth on when to use it to pick an impl vs when to do stuff like this. dealers choice... https://svn.apache.org/viewvc?view=revision&revision=658003
          Hide
          Yonik Seeley added a comment -

          In fact ill go remove my @Ignore now. Lets see if anyone can fix the bugs in it. That should settle this.

          I see you've done this. Purposely breaking the build really isn't necessary to make a point about a known bug.

          Show
          Yonik Seeley added a comment - In fact ill go remove my @Ignore now. Lets see if anyone can fix the bugs in it. That should settle this. I see you've done this. Purposely breaking the build really isn't necessary to make a point about a known bug.
          Hide
          Steve Rowe added a comment -

          From #lucene-dev IRC:

          yonik_: sarowe: if we change the name of the current html strip stuff, it seems like doing anything with luceneMatchVersion isn't needed (or overkill?)

          sarowe: but then people will need to take action to use the (non-broken) new impl
          me no likey

          yonik_: but if the name is changed, they won't use it by accident

          sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation

          yonik_: yeah

          sarowe: via a new/different name
          I'm definitely okay with that

          Show
          Steve Rowe added a comment - From #lucene-dev IRC : yonik_: sarowe: if we change the name of the current html strip stuff, it seems like doing anything with luceneMatchVersion isn't needed (or overkill?) sarowe: but then people will need to take action to use the (non-broken) new impl me no likey yonik_: but if the name is changed, they won't use it by accident sarowe: oh, you mean: don't even attempt back-compat - just provide the ability to use the previous implementation yonik_: yeah sarowe: via a new/different name I'm definitely okay with that
          Hide
          Steve Rowe added a comment -

          I had a heck of a time handling all the weird stuff that could appear inside script tags, for example, and I don't think I see much of a test for that (again... my fault.)

          I added tests with the following snippets:

          one<script><!-- <!--#config comment="<!-- \"comment\"-->"--> --></script>two
          
          => 'one
          two'
          
          one<script attr= bare><!-- action('<!-- comment -->', "\"-->\""); --></script>two
          
          => 'one
          two'
          
          hello<script><!-- f('<!--internal--></script>'); --></script>
          
          => 'hello
          '
          

          One can be sure that the current implementation doesn't always do the right thing, but unfortunately "right" isn't well defined here considering the domain.

          I agree - the "right" thing IMHO is: get as much content from as varied a range of sources as possible, and never ever allow the input to bork processing.

          The cost to keeping around the current version for a little while seems minimal.

          My proposal would do this, though under a different name, and requiring a luceneMatchVersion of 3.5 or earlier for the factory to use it. Do you object to this?

          I have two issues with not switching over now:

          1. It's a chicken and egg problem: how will people know if there is a problem with the new implementation if they don't use it?
          2. The current version has several long standing bugs that nobody has fixed. I personally wouldn't attempt it with the current implementation: it's difficult to understand. This is one of my main motivations for making this new version: when people find issues, fixing them should be much easier with the new implementation.
          Show
          Steve Rowe added a comment - I had a heck of a time handling all the weird stuff that could appear inside script tags, for example, and I don't think I see much of a test for that (again... my fault.) I added tests with the following snippets: one<script><!-- <!--#config comment="<!-- \"comment\"-->"--> --></script>two => 'one two' one<script attr= bare><!-- action('<!-- comment -->', "\"-->\""); --></script>two => 'one two' hello<script><!-- f('<!--internal--></script>'); --></script> => 'hello ' One can be sure that the current implementation doesn't always do the right thing, but unfortunately "right" isn't well defined here considering the domain. I agree - the "right" thing IMHO is: get as much content from as varied a range of sources as possible, and never ever allow the input to bork processing. The cost to keeping around the current version for a little while seems minimal. My proposal would do this, though under a different name, and requiring a luceneMatchVersion of 3.5 or earlier for the factory to use it. Do you object to this? I have two issues with not switching over now: It's a chicken and egg problem: how will people know if there is a problem with the new implementation if they don't use it? The current version has several long standing bugs that nobody has fixed. I personally wouldn't attempt it with the current implementation: it's difficult to understand. This is one of my main motivations for making this new version: when people find issues, fixing them should be much easier with the new implementation.
          Hide
          Yonik Seeley added a comment -

          The tests for the new implementation are a superset of the old implementation's tests.

          Unfortunately I'm not sure how much of a story the tests tell (and yes, that would be my fault
          My memory is rusty, but back in '05 when I coded this thing, I threw a lot stuff we had lying around CNET at it, and also a lot of stuff downloaded from the web (which I couldn't just copy-n-paste into a unit test obviously). I had a heck of a time handling all the weird stuff that could appear inside script tags, for example, and I don't think I see much of a test for that (again... my fault.)

          I welcome more examples of junk HTML to add to the tests

          Not saying the new one isn't great (and matching a lot of crap from the old one is quite an achievement).
          One can be sure that the current implementation doesn't always do the right thing, but unfortunately "right" isn't well defined here considering the domain.
          The cost to keeping around the current version for a little while seems minimal.

          Show
          Yonik Seeley added a comment - The tests for the new implementation are a superset of the old implementation's tests. Unfortunately I'm not sure how much of a story the tests tell (and yes, that would be my fault My memory is rusty, but back in '05 when I coded this thing, I threw a lot stuff we had lying around CNET at it, and also a lot of stuff downloaded from the web (which I couldn't just copy-n-paste into a unit test obviously). I had a heck of a time handling all the weird stuff that could appear inside script tags, for example, and I don't think I see much of a test for that (again... my fault.) I welcome more examples of junk HTML to add to the tests Not saying the new one isn't great (and matching a lot of crap from the old one is quite an achievement). One can be sure that the current implementation doesn't always do the right thing, but unfortunately "right" isn't well defined here considering the domain. The cost to keeping around the current version for a little while seems minimal.
          Hide
          Hoss Man added a comment -

          +1. I'll do this before committing anything.

          i wouldn't be shy about committing the new impl + tests, i would just wait to change the solr factory default behavior until we prove the perf is as good as the existing one in some common cases, and if it's not, then re-evaluate the names of the classes.

          and by common cases i'm thinking...

          • some test docs using typical wellformed html markup
          • some test docs using malformed markup that require backtracking
          • some test docs that contain almost no HTML at all (this is the one that i have a hunch may be a big differentiator – i've seen lots of people who use the HTML stripper not becuase they expect HTML, but because they want to be sure it doesn't get indexed if some stray html encoding sneaks into their data)
          Show
          Hoss Man added a comment - +1. I'll do this before committing anything. i wouldn't be shy about committing the new impl + tests, i would just wait to change the solr factory default behavior until we prove the perf is as good as the existing one in some common cases, and if it's not, then re-evaluate the names of the classes. and by common cases i'm thinking... some test docs using typical wellformed html markup some test docs using malformed markup that require backtracking some test docs that contain almost no HTML at all (this is the one that i have a hunch may be a big differentiator – i've seen lots of people who use the HTML stripper not becuase they expect HTML, but because they want to be sure it doesn't get indexed if some stray html encoding sneaks into their data)
          Hide
          Dawid Weiss added a comment -

          I welcome more examples of junk HTML to add to the tests

          Julien Nioche (Nutch) may have access to realistic large HTML crawls... there's nothing wilder and weirder than real-life HTML

          Show
          Dawid Weiss added a comment - I welcome more examples of junk HTML to add to the tests Julien Nioche (Nutch) may have access to realistic large HTML crawls... there's nothing wilder and weirder than real-life HTML
          Hide
          Steve Rowe added a comment -

          The only concern i really have is...

          A JFlex-based HTMLStripCharFilter replacement would be more performant...

          ..before deprecating "ClassicHTMLStripCharFilter" we should actually test that the Jlex version is faster ...

          +1. I'll do this before committing anything.

          Show
          Steve Rowe added a comment - The only concern i really have is... A JFlex-based HTMLStripCharFilter replacement would be more performant... ..before deprecating "ClassicHTMLStripCharFilter" we should actually test that the Jlex version is faster ... +1. I'll do this before committing anything.
          Hide
          Steve Rowe added a comment -

          are we really confident that this implementation handles everything the current one does?

          Yes, this implementation's bad/partial/weird HTML handling capabilities are a superset of the current implementation. The tests for the new implementation are a superset of the old implementation's tests.

          I welcome more examples of junk HTML to add to the tests

          Show
          Steve Rowe added a comment - are we really confident that this implementation handles everything the current one does? Yes, this implementation's bad/partial/weird HTML handling capabilities are a superset of the current implementation. The tests for the new implementation are a superset of the old implementation's tests. I welcome more examples of junk HTML to add to the tests
          Hide
          Robert Muir added a comment -

          Let's be conservative and keep around HTMLStripCharFilter, and name this one something else?
          The original was meant to handle all sorts of bad, partial, and weird input.

          I call bullshit. It fails on all kinds of bad/partial/and weird input.

          In fact ill go remove my @Ignore now. Lets see if anyone can fix the bugs in it. That should settle this.

          Show
          Robert Muir added a comment - Let's be conservative and keep around HTMLStripCharFilter, and name this one something else? The original was meant to handle all sorts of bad, partial, and weird input. I call bullshit. It fails on all kinds of bad/partial/and weird input. In fact ill go remove my @Ignore now. Lets see if anyone can fix the bugs in it. That should settle this.
          Hide
          Yonik Seeley added a comment -

          Let's be conservative and keep around HTMLStripCharFilter, and name this one something else?
          The original was meant to handle all sorts of bad, partial, and weird input.
          Or are we really confident that this implementation handles everything the current one does?

          Show
          Yonik Seeley added a comment - Let's be conservative and keep around HTMLStripCharFilter, and name this one something else? The original was meant to handle all sorts of bad, partial, and weird input. Or are we really confident that this implementation handles everything the current one does?
          Hide
          Hoss Man added a comment -

          I don't think fixing offsets bugs like LUCENE-2208 counts as breaking index backwards compat, because it won't change search results.

          It will just prevent highlighters from throwing exceptions.

          FWIW: If i understand the issue correctly, then the one risk i can imagine here is that people don't reindex, and get the new behavior for new docs, so they'll get diff behavior are query time depending on when the doc is re-indexed. that seems significant enough to definitely warrant the luceneMatchVersion toggle sarowe has on his todo list – which seems fairly straight forward.

          The only concern i really have is...

          A JFlex-based HTMLStripCharFilter replacement would be more performant...

          ..before deprecating "ClassicHTMLStripCharFilter" we should actually test that the Jlex version is faster ... because if it winds up being noticible slower in some cases, some people may prefer the the "classic" mode to the JFlex mode if the "warts" of the existing one don't affect them – in which case i might almost suggest actually using multiple factories in solr instead of making it versionMatch dependent.

          (fingers crossed it's a non-issue)

          Show
          Hoss Man added a comment - I don't think fixing offsets bugs like LUCENE-2208 counts as breaking index backwards compat, because it won't change search results. It will just prevent highlighters from throwing exceptions. FWIW: If i understand the issue correctly, then the one risk i can imagine here is that people don't reindex, and get the new behavior for new docs, so they'll get diff behavior are query time depending on when the doc is re-indexed. that seems significant enough to definitely warrant the luceneMatchVersion toggle sarowe has on his todo list – which seems fairly straight forward. The only concern i really have is... A JFlex-based HTMLStripCharFilter replacement would be more performant... ..before deprecating "ClassicHTMLStripCharFilter" we should actually test that the Jlex version is faster ... because if it winds up being noticible slower in some cases, some people may prefer the the "classic" mode to the JFlex mode if the "warts" of the existing one don't affect them – in which case i might almost suggest actually using multiple factories in solr instead of making it versionMatch dependent. (fingers crossed it's a non-issue)
          Hide
          Steve Rowe added a comment -

          Forgot to mention two more change in the latest patch from the previous version:

          7. The generated scanner's parse method now returns the next available character if one is available; this simplifies/clarifies the processing flow. (Previously the parse method returned an enum indicating whether to copy a char directly from the input buffer – OutputSource.DIRECT – or from another location – OutputSource.INDIRECT – the read() method would then have to go fetch the next character from the given source.)
          8. Renamed the generated scanner's parse method from the default yylex() to nextChar().

          Show
          Steve Rowe added a comment - Forgot to mention two more change in the latest patch from the previous version: 7. The generated scanner's parse method now returns the next available character if one is available; this simplifies/clarifies the processing flow. (Previously the parse method returned an enum indicating whether to copy a char directly from the input buffer – OutputSource.DIRECT – or from another location – OutputSource.INDIRECT – the read() method would then have to go fetch the next character from the given source.) 8. Renamed the generated scanner's parse method from the default yylex() to nextChar() .
          Hide
          Steve Rowe added a comment - - edited

          This patch contains a feature-complete version. Changes from the previous patch:

          1. Now substituting newlines instead of spaces for block-level elements; this corresponds more closely to on-screen layout, enables sentence segmentation, and doesn't change offsets.
          2. Supplementary characters are now accepted in tags.
          3. Switched accepted tag names from [:XID_Start:] and [:XID_Continue:] Unicode properties to the more relaxed [:ID_Start:] and [:ID_Continue:] properties, in order to broaden the range of recognizable input. (The improved security afforded by the XID_* properties is irrelevant to what a CharFilter does.)
          4. Uppercase character entity variants for "quot", "copy", "gt", "lt", "reg", and "amp" (from Dawid Weiss's SOLR-882 patch) are now accepted.
          5. MS-Word-generated broken processing instructions (<? ... /> instead of <? ... ?>) are now accepted.
          6. Added several tests, including parsing a full MS-Word-2010-generated HTML file.

          Left to do:

          1. Move javadocs from BaseCharFilter.addOffCorrectMap() to o.a.l.analysis.charfilter package level javadoc file.
          2. Rename the existing HTMLStripCharFilter to ClassicHTMLStripCharFilter; move it to Solr o.a.s.analysis package; deprecate it; and make it package private.
          3. Rename JFlexHTMLStripCharFilter to HTMLStripCharFilter.
          4. Enable Solr back-compat (but not Lucene back-compat, since HTMLStripCharFilter has never been released as part of Lucene) by making HTMLStripCharFilterFactory instantiate ClassicHTMLStripCharFilter if the luceneMatchVersion parameter is LUCENE_35 or earlier, and otherwise instantiate the new HTMLStripCharFilter.
          Show
          Steve Rowe added a comment - - edited This patch contains a feature-complete version. Changes from the previous patch: Now substituting newlines instead of spaces for block-level elements; this corresponds more closely to on-screen layout, enables sentence segmentation, and doesn't change offsets. Supplementary characters are now accepted in tags. Switched accepted tag names from [:XID_Start:] and [:XID_Continue:] Unicode properties to the more relaxed [:ID_Start:] and [:ID_Continue:] properties, in order to broaden the range of recognizable input. (The improved security afforded by the XID_* properties is irrelevant to what a CharFilter does.) Uppercase character entity variants for "quot", "copy", "gt", "lt", "reg", and "amp" (from Dawid Weiss's SOLR-882 patch) are now accepted. MS-Word-generated broken processing instructions ( <? ... /> instead of <? ... ?> ) are now accepted. Added several tests, including parsing a full MS-Word-2010-generated HTML file. Left to do: Move javadocs from BaseCharFilter.addOffCorrectMap() to o.a.l.analysis.charfilter package level javadoc file. Rename the existing HTMLStripCharFilter to ClassicHTMLStripCharFilter ; move it to Solr o.a.s.analysis package; deprecate it; and make it package private. Rename JFlexHTMLStripCharFilter to HTMLStripCharFilter . Enable Solr back-compat (but not Lucene back-compat, since HTMLStripCharFilter has never been released as part of Lucene) by making HTMLStripCharFilterFactory instantiate ClassicHTMLStripCharFilter if the luceneMatchVersion parameter is LUCENE_35 or earlier, and otherwise instantiate the new HTMLStripCharFilter .
          Hide
          Robert Muir added a comment -

          There are differences in the behavior, but I guess all of these could be characterized as bug fixes:

          This sounds awesome!

          Good point. Should I make it a new Lucene feature on 3.X? That is, should I remove Solr's HTMLStripCharFilter and have it refer to a new Lucene HTMLStripCharFilter?

          I think so.

          If you want to, you can make it package-private + deprecated inside o.a.solr.analysis. The Solr factory could respect
          the luceneMatchVersion parameter for backwards compatibility, and instantiate the old version in that case.

          So you would then move the old charfilter from modules/analysis to the same place in trunk, too... its purely a solr
          back-compat issue. For lucene users its just a new feature and they don't see any of this.

          This is what we did with Synonyms.

          Show
          Robert Muir added a comment - There are differences in the behavior, but I guess all of these could be characterized as bug fixes: This sounds awesome! Good point. Should I make it a new Lucene feature on 3.X? That is, should I remove Solr's HTMLStripCharFilter and have it refer to a new Lucene HTMLStripCharFilter? I think so. If you want to, you can make it package-private + deprecated inside o.a.solr.analysis. The Solr factory could respect the luceneMatchVersion parameter for backwards compatibility, and instantiate the old version in that case. So you would then move the old charfilter from modules/analysis to the same place in trunk, too... its purely a solr back-compat issue. For lucene users its just a new feature and they don't see any of this. This is what we did with Synonyms.
          Hide
          Steve Rowe added a comment -

          AFAICT, SOLR-2891 will be fixed by this implementation.

          Show
          Steve Rowe added a comment - AFAICT, SOLR-2891 will be fixed by this implementation.
          Hide
          Steve Rowe added a comment -

          This would be the first back-compat enabled CharFilter.

          What would be the motivation?

          There are differences in the behavior, but I guess all of these could be characterized as bug fixes:

          1. Supplementary characters in tags will be recognized. The old version doesn't do this.
          2. CDATA sections are recognized. The old version doesn't; people have requested this, e.g. http://www.lucidimagination.com/search/document/48fcd906e39764ec#48fcd906e39764ec)
          3. No space is substituted for inline tags (e.g. <b>, <i>, <span>). The old version substitutes spaces for all tags; people have complained e.g. on SOLR-1343
          4. Broken MS-Word-generated processing instructions <? ... /> will be handled.
          5. Uppercase character entities "quot", "copy", "gt", "lt", "reg", and "amp" will be recognized (from Dawid Weiss's SOLR-882 patch); the old version doesn't do this.

          Are there some features of the previous one that don't make sense in this implementation?

          No, not as far as I can tell. I think all features of the previous one are included.

          Also I don't know how Version etc would work here, since the old HtmlStripCharFilter was never part of lucene. from lucene's perspective, its a new feature.

          Good point. Should I make it a new Lucene feature on 3.X? That is, should I remove Solr's HTMLStripCharFilter and have it refer to a new Lucene HTMLStripCharFilter?

          Show
          Steve Rowe added a comment - This would be the first back-compat enabled CharFilter. What would be the motivation? There are differences in the behavior, but I guess all of these could be characterized as bug fixes: Supplementary characters in tags will be recognized. The old version doesn't do this. CDATA sections are recognized. The old version doesn't; people have requested this, e.g. http://www.lucidimagination.com/search/document/48fcd906e39764ec#48fcd906e39764ec ) No space is substituted for inline tags (e.g. <b> , <i> , <span> ). The old version substitutes spaces for all tags; people have complained e.g. on SOLR-1343 Broken MS-Word-generated processing instructions <? ... /> will be handled. Uppercase character entities "quot", "copy", "gt", "lt", "reg", and "amp" will be recognized (from Dawid Weiss's SOLR-882 patch); the old version doesn't do this. Are there some features of the previous one that don't make sense in this implementation? No, not as far as I can tell. I think all features of the previous one are included. Also I don't know how Version etc would work here, since the old HtmlStripCharFilter was never part of lucene. from lucene's perspective, its a new feature. Good point. Should I make it a new Lucene feature on 3.X? That is, should I remove Solr's HTMLStripCharFilter and have it refer to a new Lucene HTMLStripCharFilter?
          Hide
          Steve Rowe added a comment -

          Here's another to-do, from comment-12625835 in SOLR-42: handle MS-Word-generated broken processing instructions <? ... /> (instead of <? ... ?>.

          Show
          Steve Rowe added a comment - Here's another to-do, from comment-12625835 in SOLR-42 : handle MS-Word-generated broken processing instructions <? ... /> (instead of <? ... ?> .
          Hide
          Robert Muir added a comment -

          Also I don't know how Version etc would work here, since the old HtmlStripCharFilter was never part of lucene.

          from lucene's perspective, its a new feature.

          Show
          Robert Muir added a comment - Also I don't know how Version etc would work here, since the old HtmlStripCharFilter was never part of lucene. from lucene's perspective, its a new feature.
          Hide
          Robert Muir added a comment -

          This would be the first back-compat enabled CharFilter. Should the existing HTMLStripCharFilter instead vanish off the face of the earth?

          What would be the motivation? Are there some features of the previous one that don't make sense in this implementation?
          I don't think fixing offsets bugs like LUCENE-2208 counts as breaking index backwards compat, because it won't change search results.
          It will just prevent highlighters from throwing exceptions.

          (LUCENE-3642 and LUCENE-3668 fixed lots of offset bugs already, we didn't spend time on any back compat).

          Show
          Robert Muir added a comment - This would be the first back-compat enabled CharFilter. Should the existing HTMLStripCharFilter instead vanish off the face of the earth? What would be the motivation? Are there some features of the previous one that don't make sense in this implementation? I don't think fixing offsets bugs like LUCENE-2208 counts as breaking index backwards compat, because it won't change search results. It will just prevent highlighters from throwing exceptions. ( LUCENE-3642 and LUCENE-3668 fixed lots of offset bugs already, we didn't spend time on any back compat).
          Hide
          Brian Meidell added a comment -

          Any idea how quickly this will be rolled into a new version?

          Show
          Brian Meidell added a comment - Any idea how quickly this will be rolled into a new version?
          Hide
          Robert Muir added a comment -

          Robert's random test is un-@Ignore'd

          +1!

          Show
          Robert Muir added a comment - Robert's random test is un-@Ignore'd +1!
          Hide
          Steve Rowe added a comment -

          Added a <![CDATA[...]]> section test, and fixed the CDATA section handling.

          Show
          Steve Rowe added a comment - Added a <![CDATA[...]]> section test, and fixed the CDATA section handling.
          Hide
          Steve Rowe added a comment -

          One other thing left to do: I added javadocs to BaseCharFilter.addOffCorrectMap(), but I think they should be moved to the package javadocs for o.a.l.analysis.charfilter, and maybe add some examples. (This package has basically zero documentation currently.)

          Show
          Steve Rowe added a comment - One other thing left to do: I added javadocs to BaseCharFilter.addOffCorrectMap() , but I think they should be moved to the package javadocs for o.a.l.analysis.charfilter , and maybe add some examples. (This package has basically zero documentation currently.)
          Hide
          Steve Rowe added a comment -

          This JFlex-based HTMLStripCharFilter replacement will fix LUCENE-2208.

          Show
          Steve Rowe added a comment - This JFlex-based HTMLStripCharFilter replacement will fix LUCENE-2208 .
          Hide
          Steve Rowe added a comment -

          Left to do:

          1. Accept supplementary characters in HTML tag names.
          2. Recognize uppercase character entity variants for "quot", "copy", "gt", "lt", "reg", and "amp" (from Dawid Weiss's SOLR-882 patch)
          3. Rename current HTMLStripCharFilter to ClassicHTMLStripCharFilter
          4. Rename JFlexStripCharFilter to HTMLStripCharFilterImpl
          5. Make a back-compat-enabling wrapper class (like StandardTokenizer) that will use ClassicHTMLStripCharFilter for Version.LUCENE_35 and earlier, and HTMLStripCharFilterImpl otherwise.

          This would be the first back-compat enabled CharFilter. Should the existing HTMLStripCharFilter instead vanish off the face of the earth?

          Show
          Steve Rowe added a comment - Left to do: Accept supplementary characters in HTML tag names. Recognize uppercase character entity variants for "quot", "copy", "gt", "lt", "reg", and "amp" (from Dawid Weiss's SOLR-882 patch) Rename current HTMLStripCharFilter to ClassicHTMLStripCharFilter Rename JFlexStripCharFilter to HTMLStripCharFilterImpl Make a back-compat-enabling wrapper class (like StandardTokenizer ) that will use ClassicHTMLStripCharFilter for Version.LUCENE_35 and earlier, and HTMLStripCharFilterImpl otherwise. This would be the first back-compat enabled CharFilter. Should the existing HTMLStripCharFilter instead vanish off the face of the earth?
          Hide
          Steve Rowe added a comment -

          Patch implementing the idea.

          This patch just creates a new class named JFlexHTMLStripCharFilter. All of the HTMLStripCharFilterTest tests are copied over to JFlexHTMLStripCharFilterTest; Robert's random test is un-@Ignore'd, and several more tests are added.

          Show
          Steve Rowe added a comment - Patch implementing the idea. This patch just creates a new class named JFlexHTMLStripCharFilter . All of the HTMLStripCharFilterTest tests are copied over to JFlexHTMLStripCharFilterTest ; Robert's random test is un- @Ignore 'd, and several more tests are added.

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Steve Rowe
            • Votes:
              3 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development