Lucene - Core
  1. Lucene - Core
  2. LUCENE-4220

Replace benchmarks crazy HTML parser by a nekohtml 10-liner

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-BETA, 6.0
    • Component/s: modules/benchmark
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Benchmark contains a javacc-based HTML parser which of course violates all specs, is huge and error prone.

      I can replace it by a NEKOHTML based one (approx 10 - 20 lines of code). NEKOHTML is an extension for XERCES (that we already use to read wikipedia), that produces SAX-events or DOM tree out of a HTML file usingg standard XML APIS. We could also use TIKA, but I refuse to download the Internet to get TIKA running for just parsing a HTML file.

      1. LUCENE-4220.patch
        287 kB
        Uwe Schindler
      2. LUCENE-4220.patch
        280 kB
        Uwe Schindler
      3. LUCENE-4220.patch
        280 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Path using NekoHTML.

          The patch currently has a workaround for the Turkish Locale bug, because NekoHTML uses toLowerCase/toUpperCase without locale to "normalize" element and attribute names (see http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html). I opened NekoHTML bug: https://sourceforge.net/tracker/?func=detail&aid=3544334&group_id=195122&atid=952178

          The patch mimics most of the behaviour of the old JavaCC based parser (but it does not lowercase META element values, which is bogus. Keys are lowercased - with Locale).

          I copied the old parser's test to the feeds package and added some additional tests for turkish and some other types of invalid HTML like plain text or missing elements.

          Show
          Uwe Schindler added a comment - Path using NekoHTML. The patch currently has a workaround for the Turkish Locale bug, because NekoHTML uses toLowerCase/toUpperCase without locale to "normalize" element and attribute names (see http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html ). I opened NekoHTML bug: https://sourceforge.net/tracker/?func=detail&aid=3544334&group_id=195122&atid=952178 The patch mimics most of the behaviour of the old JavaCC based parser (but it does not lowercase META element values, which is bogus. Keys are lowercased - with Locale). I copied the old parser's test to the feeds package and added some additional tests for turkish and some other types of invalid HTML like plain text or missing elements.
          Hide
          Uwe Schindler added a comment -

          Small improvements to make parser more universal useable (allows InputSource), perf improvements on element matching.

          Show
          Uwe Schindler added a comment - Small improvements to make parser more universal useable (allows InputSource), perf improvements on element matching.
          Hide
          Robert Muir added a comment -

          patch removes 8,600 lines of code

          +1!

          Show
          Robert Muir added a comment - patch removes 8,600 lines of code +1!
          Hide
          Uwe Schindler added a comment -

          The original patch had a bug (which was caused by my misunderstanding and missing test data).

          Other changes:

          • the new Parser now correctly implements TrecParser interface and also cleans up the whole HTMLParser interface.
          • removed useless InterruptedException from method signatures (was only there because of the crazy old parser)
          • Fixed NPE in parsing date from <meta.../> elements

          It would be good if someone (e.g. Doron Cohen, who wrote the original parser or anybody else who has a license) could temporarily provide the Gov2 TREC collection to me, so that I can check that all is working as expected. The test data is horrible small.

          Nevertheless, I will commit the current state soon to trunk and 4.x.

          Show
          Uwe Schindler added a comment - The original patch had a bug (which was caused by my misunderstanding and missing test data). Other changes: the new Parser now correctly implements TrecParser interface and also cleans up the whole HTMLParser interface. removed useless InterruptedException from method signatures (was only there because of the crazy old parser) Fixed NPE in parsing date from <meta.../> elements It would be good if someone (e.g. Doron Cohen, who wrote the original parser or anybody else who has a license) could temporarily provide the Gov2 TREC collection to me, so that I can check that all is working as expected. The test data is horrible small. Nevertheless, I will commit the current state soon to trunk and 4.x.
          Hide
          Uwe Schindler added a comment - - edited

          Committed trunk revision: 1361741
          Committed 4.x revision: 1361743

          Show
          Uwe Schindler added a comment - - edited Committed trunk revision: 1361741 Committed 4.x revision: 1361743

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development