Lucene - Core
  1. Lucene - Core
  2. LUCENE-590

Demo HTML parser gives incorrect summaries when title is repeated as a heading

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/examples
    • Labels:
      None

      Description

      If you have an html document where the title is repeated as a heading at the top of the document, the HTMLParser will return the title as the summary, ignoring everything else that was added to the summary. Instead, it should keep the rest of the summary and chop off the title part at the beginning (essentially the opposite). I don't see any benefit to repeating the title in the summary for any case.

      In HTMLParser.jj's getSummary():

      String sum = summary.toString().trim();
      String tit = getTitle();
      if (sum.startsWith(tit) || sum.equals(""))
      return tit;
      else
      return sum;

      change it to: (* denotes a line that has changed)

      String sum = summary.toString().trim();
      String tit = getTitle();

      • if (sum.startsWith(tit)) // don't repeat title in summary
      • return sum.substring(tit.length()).trim();
        else
        return sum;

        Activity

        Hide
        Daniel Naber added a comment -

        decrease priority (affects demo only)

        Show
        Daniel Naber added a comment - decrease priority (affects demo only)
        Hide
        Robert Muir added a comment -

        here's a patch with a test... we dont even need to substring the summary...
        the title is never added to the summary.

        Show
        Robert Muir added a comment - here's a patch with a test... we dont even need to substring the summary... the title is never added to the summary.
        Hide
        Robert Muir added a comment -

        Committed revision 1031467, 1031468 (3x)
        Thanks Curtis!

        Show
        Robert Muir added a comment - Committed revision 1031467, 1031468 (3x) Thanks Curtis!
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Robert Muir
            Reporter:
            Curtis d'Entremont
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development