Solr
  1. Solr
  2. SOLR-2328

HTMLStripCharFilter Leaves Broken HTML Tags

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4.1
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Some kinds of 'bad' HTML are missed by HTMLStripCharFilter. For example, the following invalid HTML:
      <a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>

      Is filtered to:
      <a href="http://www.twitter.com/ceonyc"@ceonyc

      I understand the challenge here, without the end > it's tough to know what to do. It turns out that real-world web pages are full of this kind of garbage HTML, and browsers (impressively!) seem to handle this quite gracefully.

      Plus, users in my app can search for 'href' and find lots of matches (that don't appear to contain 'href') as a result.

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jeff Nadler
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development