Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4220

Replace benchmarks crazy HTML parser by a nekohtml 10-liner

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.0-ALPHA
    • 4.0-BETA, 6.0
    • modules/benchmark
    • None
    • New

    Description

      Benchmark contains a javacc-based HTML parser which of course violates all specs, is huge and error prone.

      I can replace it by a NEKOHTML based one (approx 10 - 20 lines of code). NEKOHTML is an extension for XERCES (that we already use to read wikipedia), that produces SAX-events or DOM tree out of a HTML file usingg standard XML APIS. We could also use TIKA, but I refuse to download the Internet to get TIKA running for just parsing a HTML file.

      Attachments

        1. LUCENE-4220.patch
          280 kB
          Uwe Schindler
        2. LUCENE-4220.patch
          280 kB
          Uwe Schindler
        3. LUCENE-4220.patch
          287 kB
          Uwe Schindler

        Issue Links

          Activity

            People

              uschindler Uwe Schindler
              uschindler Uwe Schindler
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: