some test docs using typical wellformed html markup
I have access to ClueWeb09. For performance testing I used the first WARC file for the English and Chinese languages (en0000/00.warc.gz and zh0000/00.warc.gz), each of which when uncompressed contains about 1GB of text (including a small amount of non-HTML metadata: WARC information and HTTP headers). The English WARC contains about 35,000 documents from about 2,100 unique domains. The Chinese WARC contains about 33,000 documents from about 550 unique domains.
I compared JFlexHTMLStripCharFilter's output with that of HTMLStripCharFilter for several hundred documents. In the course of this comparison, I found several problems with the JFlex implementation (e.g. no <STYLE> tag handling; no MS conditional tag handling, e.g. <![if ! IE]>; and some problems handling creative attribute values), which the attached patch fixes. I re-ran the text-only and malformed HTML performance tests on the final implementation, and the numbers aren't significantly different from those prior to these fixes. The new patch also contains the more-evil _TestUtils.randomHtmlishString(); shifts the CharFilter javadocs from BaseCharFilter.addOffCorrectMapping() to package.html; and adds several more tests to JFlexHTMLStripCharFilterTest.java.
I have attached the three classes I used to test performance over the ClueWeb09 subset. BaselineWarcTest.java uses the WarcRecord class supplied with the ClueWeb09 collection to read the compressed WARC files; looks for a declared charset first in each document's content in the Content-Type <meta> tag, and then in the HTTP header; feeds this charset, if any, to the ICU4J charset detector, which instantiates a Reader using the detected charset; and then read()'s all content. The other two classes add the respective CharFilter on top of BaselineWarcTest's functionality.
The performance numbers (best of 5 trials):
Excluding charset detection and I/O (measured by BaselineWarcTest), JFlexHTMLStripCharFilter appears to improve on HTMLStripCharFilter's throughput by about 50% in both languages.
I found a few problems with HTMLStripCharFilter:
- The following exception was thrown for six of the English documents:
java.io.IOException: Mark invalid
- ' is not decoded.
- Content between some <script> tags is not stripped out.
- Unbalanced quotation marks in opening tags cause the tag to not be stripped out.
Left to do:
- Rename HTMLStripCharFilter to ClassicHTMLStripCharFilter; move it to Solr o.a.s.analysis package; deprecate it; and create a new Solr Factory for it.
- Rename JFlexHTMLStripCharFilter to HTMLStripCharFilter.
- Commit to trunk
- Backport and commit to branch_3x.