Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1394

HTML stripper is splitting tokens

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4
    • 1.4
    • Schema and Analysis
    • None

    Description

      The Solr HTML stripper is replacing any removed HTML with whitespace. This is to keep offsets correct for highlighting.

      However, as was already pointed out in SOLR-42, this means that any token containing an HTML entity will be split into several tokens. That makes the HTML stripper completely unreliable for international text (and any text is potentially interantional).

      The current code is actually deficient for BOTH highlighting and indexing, where the previous incarnation (that did not insert spaces) only had problems with highlighting.

      The only workaround is to not use entities at all, which is impossible in some situations and inconvenient in most situations. If the client is required to transform entities before handing it to Solr, it might as well be required to also strip tags, and then the HTML stripper would not be needed at all.

      Today, we have a better solution that can be used: offset correction. We can then avoid inserting extra whitespace, but still get correct offsets. The attached patch implements just that.

      Attachments

        1. SOLR-1394.patch
          7 kB
          Anders Melchiorsen
        2. SOLR-1394.patch
          15 kB
          Anders Melchiorsen
        3. hex-entity.patch
          2 kB
          Anders Melchiorsen

        Activity

          People

            Unassigned Unassigned
            andersm Anders Melchiorsen
            Votes:
            2 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: