Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-882

HTMLStripReader improvement - padding corrected for hexadecimal entities, option not to emit padding at all added

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Improvements to HTMLStripHighlighter:

      • fix padding of hexadecimal entities (currently off by 1)
      • add an option not to emit padding at all. In certain applications padding emitted after entities such as ó may split words that are in fact single terms.
      • add entities that are recognized when written all in uppercase and recognized by browsers.
      1. patch
        18 kB
        Dawid Weiss

        Issue Links

          Activity

          Hide
          dawidweiss Dawid Weiss added a comment -
          • Fixes hex. entities padding.
          • Adds a trigger to disable padding entirely.
          • Adds more tests to the test class.
          Show
          dawidweiss Dawid Weiss added a comment - Fixes hex. entities padding. Adds a trigger to disable padding entirely. Adds more tests to the test class.
          Hide
          dawidweiss Dawid Weiss added a comment -
          • Hex. entity handling improved (more issues with proper padding when entities were not terminated with a ';')
          • Added recognition of all-uppercase entities (exceptions).
          Show
          dawidweiss Dawid Weiss added a comment - Hex. entity handling improved (more issues with proper padding when entities were not terminated with a ';') Added recognition of all-uppercase entities (exceptions).
          Hide
          dawidweiss Dawid Weiss added a comment -

          All tests pass and I've added a few more that did not in the previous version.

          Show
          dawidweiss Dawid Weiss added a comment - All tests pass and I've added a few more that did not in the previous version.
          Hide
          gsingers Grant Ingersoll added a comment -

          Hi Dawid,

          I don't understand the changes to the main() method in HTMLStripReader. Why the System.exit() but then keep the old piece?

          Show
          gsingers Grant Ingersoll added a comment - Hi Dawid, I don't understand the changes to the main() method in HTMLStripReader. Why the System.exit() but then keep the old piece?
          Hide
          dawidweiss Dawid Weiss added a comment -

          Argh, good catch, Grant. The entire patch is fine, with the exception of the main method. What you saw in there was a dump of entities that I had to make in order to test which entities are recognized in uppercase mode and which were not. Apologies that this slipped through somehow. Do you want me to remove this from the patch or can you simply disregard the fragment that applies to the main method?

          Show
          dawidweiss Dawid Weiss added a comment - Argh, good catch, Grant. The entire patch is fine, with the exception of the main method. What you saw in there was a dump of entities that I had to make in order to test which entities are recognized in uppercase mode and which were not. Apologies that this slipped through somehow. Do you want me to remove this from the patch or can you simply disregard the fragment that applies to the main method?
          Hide
          dawidweiss Dawid Weiss added a comment -

          One more thing that may be of importance – this patch fixes a few problems, but it also alters the default behavior, so folks that have processed some large volumes of data may have different results now. I am not entirely sure how it affects the rest of SOLR.

          Show
          dawidweiss Dawid Weiss added a comment - One more thing that may be of importance – this patch fixes a few problems, but it also alters the default behavior, so folks that have processed some large volumes of data may have different results now. I am not entirely sure how it affects the rest of SOLR.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          It seems we forgot this issue for 1.4. Marking it for 1.5

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - It seems we forgot this issue for 1.4. Marking it for 1.5
          Hide
          hossman Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          hossman Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          gsingers Grant Ingersoll added a comment -

          Dawid,

          Does this still apply, given HTMLStripReader is replaced by a CharFilter?

          -Grant

          Show
          gsingers Grant Ingersoll added a comment - Dawid, Does this still apply, given HTMLStripReader is replaced by a CharFilter? -Grant
          Hide
          rcmuir Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          rcmuir Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          rcmuir Robert Muir added a comment -

          3.4 -> 3.5

          Show
          rcmuir Robert Muir added a comment - 3.4 -> 3.5
          Hide
          steve_rowe Steve Rowe added a comment -

          Dawid Weiss's uppercase character entities are included in the new HTMLStripCharFilter implementation committed in LUCENE-3690.

          Show
          steve_rowe Steve Rowe added a comment - Dawid Weiss's uppercase character entities are included in the new HTMLStripCharFilter implementation committed in LUCENE-3690 .

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              dawidweiss Dawid Weiss
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development