Solr
  1. Solr
  2. SOLR-882

HTMLStripReader improvement - padding corrected for hexadecimal entities, option not to emit padding at all added

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Improvements to HTMLStripHighlighter:

      • fix padding of hexadecimal entities (currently off by 1)
      • add an option not to emit padding at all. In certain applications padding emitted after entities such as ó may split words that are in fact single terms.
      • add entities that are recognized when written all in uppercase and recognized by browsers.
      1. patch
        18 kB
        Dawid Weiss

        Issue Links

          Activity

          Hide
          Dawid Weiss added a comment -
          • Fixes hex. entities padding.
          • Adds a trigger to disable padding entirely.
          • Adds more tests to the test class.
          Show
          Dawid Weiss added a comment - Fixes hex. entities padding. Adds a trigger to disable padding entirely. Adds more tests to the test class.
          Hide
          Dawid Weiss added a comment -
          • Hex. entity handling improved (more issues with proper padding when entities were not terminated with a ';')
          • Added recognition of all-uppercase entities (exceptions).
          Show
          Dawid Weiss added a comment - Hex. entity handling improved (more issues with proper padding when entities were not terminated with a ';') Added recognition of all-uppercase entities (exceptions).
          Hide
          Dawid Weiss added a comment -

          All tests pass and I've added a few more that did not in the previous version.

          Show
          Dawid Weiss added a comment - All tests pass and I've added a few more that did not in the previous version.
          Hide
          Grant Ingersoll added a comment -

          Hi Dawid,

          I don't understand the changes to the main() method in HTMLStripReader. Why the System.exit() but then keep the old piece?

          Show
          Grant Ingersoll added a comment - Hi Dawid, I don't understand the changes to the main() method in HTMLStripReader. Why the System.exit() but then keep the old piece?
          Hide
          Dawid Weiss added a comment -

          Argh, good catch, Grant. The entire patch is fine, with the exception of the main method. What you saw in there was a dump of entities that I had to make in order to test which entities are recognized in uppercase mode and which were not. Apologies that this slipped through somehow. Do you want me to remove this from the patch or can you simply disregard the fragment that applies to the main method?

          Show
          Dawid Weiss added a comment - Argh, good catch, Grant. The entire patch is fine, with the exception of the main method. What you saw in there was a dump of entities that I had to make in order to test which entities are recognized in uppercase mode and which were not. Apologies that this slipped through somehow. Do you want me to remove this from the patch or can you simply disregard the fragment that applies to the main method?
          Hide
          Dawid Weiss added a comment -

          One more thing that may be of importance – this patch fixes a few problems, but it also alters the default behavior, so folks that have processed some large volumes of data may have different results now. I am not entirely sure how it affects the rest of SOLR.

          Show
          Dawid Weiss added a comment - One more thing that may be of importance – this patch fixes a few problems, but it also alters the default behavior, so folks that have processed some large volumes of data may have different results now. I am not entirely sure how it affects the rest of SOLR.
          Hide
          Shalin Shekhar Mangar added a comment -

          It seems we forgot this issue for 1.4. Marking it for 1.5

          Show
          Shalin Shekhar Mangar added a comment - It seems we forgot this issue for 1.4. Marking it for 1.5
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Grant Ingersoll added a comment -

          Dawid,

          Does this still apply, given HTMLStripReader is replaced by a CharFilter?

          -Grant

          Show
          Grant Ingersoll added a comment - Dawid, Does this still apply, given HTMLStripReader is replaced by a CharFilter? -Grant
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Steve Rowe added a comment -

          Dawid Weiss's uppercase character entities are included in the new HTMLStripCharFilter implementation committed in LUCENE-3690.

          Show
          Steve Rowe added a comment - Dawid Weiss's uppercase character entities are included in the new HTMLStripCharFilter implementation committed in LUCENE-3690 .

            People

            • Assignee:
              Steve Rowe
              Reporter:
              Dawid Weiss
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development