Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5722

Add catenateShingles option to WordDelimiterFilter

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      Apologies if I put this in the wrong spot. I'm attaching a patch (against current trunk) that adds support for a 'catenateShingles' option to the WordDelimiterFilter.

      We (National Library of Australia - NLA) are currently maintaining this as an internal modification to the Filter, but I believe it is generic enough to contribute upstream.

      Description:
      =========

      /**
       * NLA Modification to the standard word delimiter to support various
       * hyphenation use cases. Primarily driven by requirements for
       * newspapers where words are often broken across line endings.
       *
       *  eg. "hyphenated-surname" is printed printed across a line ending and
       *         turns out like "hyphen-ated-surname" or "hyphenated-sur-name".
       *
       *  In this scenario the stock filter, with 'catenateAll' turned on, will
       *  generate individual tokens plus one combined token, but not
       *  sub-tokens like "hyphenated surname" and "hyphenatedsur name".
       *
       *  So we add a new 'catenateShingles' to achieve this.
      */
      

      Includes unit tests, and as is noted in one of them CATENATE_WORDS and CATENATE_SHINGLES are logically considered mutually exclusive for sensible usage and can cause duplicate tokens (although they should have the same positions etc).

      I'm happy to work on it more if anyone finds problems with it.

      Attachments

        1. WDFconcatShingles.patch
          16 kB
          Greg Pendlebury

        Activity

          People

            Unassigned Unassigned
            gpendleb Greg Pendlebury
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: