Solr
  1. Solr
  2. SOLR-41

PATCH: HyphenatedWordsFilter, Factory and test

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: search
    • Labels:
      None

      Description

      When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines. This is often the case with documents where narrow text columns are used, such as newsletters.
      In order to increase searching efficiency, this filter unites hyphenated words broken in two lines.
      This filter has to be used together with the WordDelimiterFilter having catenateWords=1.

      1. hyphenatedwordsfilter.patch
        7 kB
        Boris Vitez
      2. HyphenatedWordsFilter.java
        4 kB
        Boris Vitez
      3. HyphenatedWordsFilterFactory.java
        1.0 kB
        Boris Vitez
      4. TestHyphenatedWordsFilter.java
        2 kB
        Boris Vitez
      5. hyphenatedwordsfilter.patch
        7 kB
        Boris Vitez

        Activity

        Hide
        Hoss Man added a comment -

        This bug was modified as part of a bulk update using the criteria...

        • Marked ("Resolved" or "Closed") and "Fixed"
        • Had no "Fix Version" versions
        • Was listed in the CHANGES.txt for 1.1

        The Fix Version for all 38 issues found was set to 1.1, email notification
        was suppressed to prevent excessive email.

        For a list of all the issues modified, search jira comments for this
        (hopefully) unique string: 20080415hossman3

        Show
        Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.1 The Fix Version for all 38 issues found was set to 1.1, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman3
        Hide
        Yonik Seeley added a comment -

        Thanks Boris, I just committed this.

        Show
        Yonik Seeley added a comment - Thanks Boris, I just committed this.
        Hide
        Boris Vitez added a comment -

        As Yonik suggested, I uploaded the latest .diff file only. Please ignore .java attachments.
        The filter now works standalone (without WordDelimiterFilter). I couldn't use suggested setTermText on the existing token as I needed to set correct start and end offsets. The newly created token has the same position increment as the first token that contains the hyphen.

        Show
        Boris Vitez added a comment - As Yonik suggested, I uploaded the latest .diff file only. Please ignore .java attachments. The filter now works standalone (without WordDelimiterFilter). I couldn't use suggested setTermText on the existing token as I needed to set correct start and end offsets. The newly created token has the same position increment as the first token that contains the hyphen.
        Hide
        Yonik Seeley added a comment -

        Did you try uploading them again? JIRA allows multiple copies of the same file , keeps track of the newest version and greys out older versions.

        Unified diff is actually the prefered patch format... just do "svn diff" from the solr base dir (after "svn add" on any new files)

        Show
        Yonik Seeley added a comment - Did you try uploading them again? JIRA allows multiple copies of the same file , keeps track of the newest version and greys out older versions. Unified diff is actually the prefered patch format... just do "svn diff" from the solr base dir (after "svn add" on any new files)
        Hide
        Boris Vitez added a comment -

        Yonik, can you please remove all attachments, so that I can upload the latest versions.
        I changed the filter to preserve the position increment and not to depend on the WordDelimiterFilter.

        Show
        Boris Vitez added a comment - Yonik, can you please remove all attachments, so that I can upload the latest versions. I changed the filter to preserve the position increment and not to depend on the WordDelimiterFilter.
        Hide
        Boris Vitez added a comment -

        Thank you for the feedback and suggestion.
        I will change the Filter to use this new feature of Token class as soon as I'm back - on Monday.

        Show
        Boris Vitez added a comment - Thank you for the feedback and suggestion. I will change the Filter to use this new feature of Token class as soon as I'm back - on Monday.
        Hide
        Yonik Seeley added a comment -

        Thanks Boris!

        A common problem when creating new tokens is losing existing position increments.
        I recently changed Lucene's Token class so that it's cloneable and you can change the text with setTermText().

        So you may want to just change the text of the first token rather than creating a new one.

        Show
        Yonik Seeley added a comment - Thanks Boris! A common problem when creating new tokens is losing existing position increments. I recently changed Lucene's Token class so that it's cloneable and you can change the text with setTermText(). So you may want to just change the text of the first token rather than creating a new one.

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Boris Vitez
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development