Lucene - Core
  1. Lucene - Core
  2. LUCENE-889

Standard tokenizer with punctuation output

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Won't Fix
    • Affects Version/s: 2.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This patch adds punctuation (comma, period, question mark and exclamation point) tokens as output from the StandardTokenizer, and filters them out in the StandardFilter.

      (I needed them for text classification reasons.)

      1. standard.patch
        46 kB
        Karl Wettin
      2. test.patch
        1 kB
        Karl Wettin

        Activity

        Hide
        Karl Wettin added a comment -

        standard.patch is root src/java/org/apache/lucene/analysis
        test.patch is root src/test/org/apache/lucene/analysis

        I'm sorry about the non-trunk patch. My local copy of Lucene is a bit messed up.

        Show
        Karl Wettin added a comment - standard.patch is root src/java/org/apache/lucene/analysis test.patch is root src/test/org/apache/lucene/analysis I'm sorry about the non-trunk patch. My local copy of Lucene is a bit messed up.
        Hide
        Erik Hatcher added a comment -

        This patch concerns me. This changes default behavior in a very basic and commonly used piece of Lucene. At the very least this should be made entirely optional and off by default.

        Thoughts?

        Show
        Erik Hatcher added a comment - This patch concerns me. This changes default behavior in a very basic and commonly used piece of Lucene. At the very least this should be made entirely optional and off by default. Thoughts?
        Hide
        Karl Wettin added a comment -

        Erik Hatcher [25/May/07 06:57 AM]
        > This patch concerns me. This changes default behavior
        > in a very basic and commonly used piece of Lucene. At
        > the very least this should be made entirely optional and
        > off by default.
        >
        > Thoughts?

        It is off by default. The punctuation comes out from the tokenizer, but the StandardAnalyzer uses a StandardFilter, and the StandardFilter will filter out the punctuation tokens. In order to get the punctuation, one needs to use a plain StandardTokenizer.

        Show
        Karl Wettin added a comment - Erik Hatcher [25/May/07 06:57 AM] > This patch concerns me. This changes default behavior > in a very basic and commonly used piece of Lucene. At > the very least this should be made entirely optional and > off by default. > > Thoughts? It is off by default. The punctuation comes out from the tokenizer, but the StandardAnalyzer uses a StandardFilter, and the StandardFilter will filter out the punctuation tokens. In order to get the punctuation, one needs to use a plain StandardTokenizer.
        Hide
        Hoss Man added a comment -

        > In order to get the punctuation, one needs to use a plain StandardTokenizer.

        I believe that is Erik's point. StandardTokenizer is a public class that many people use directly (specifically: every one who has ever posted a question about changing the behavior of StandardAnalyzer and been given the stock answer "write your own Analyzer that uses the same Tokenizer and change/adds the list of TokenFilters.

        Show
        Hoss Man added a comment - > In order to get the punctuation, one needs to use a plain StandardTokenizer. I believe that is Erik's point. StandardTokenizer is a public class that many people use directly (specifically: every one who has ever posted a question about changing the behavior of StandardAnalyzer and been given the stock answer "write your own Analyzer that uses the same Tokenizer and change/adds the list of TokenFilters.
        Hide
        Karl Wettin added a comment -

        Hoss Man [25/May/07 11:14 AM]
        > > In order to get the punctuation, one needs to use a plain StandardTokenizer.
        >
        > I believe that is Erik's point. StandardTokenizer is a public class that many
        > people use directly (specifically: every one who has ever posted a question
        > about changing the behavior of StandardAnalyzer and been given the stock
        > answer "write your own Analyzer that uses the same Tokenizer and
        > change/adds the list of TokenFilters.

        Aha. My JavaCC-skills aren't that great. I'll look in to it.

        I presume something like

        isTokenizingPuctuation() && token = <PUNCTUATION> |

        is possible.

        Show
        Karl Wettin added a comment - Hoss Man [25/May/07 11:14 AM] > > In order to get the punctuation, one needs to use a plain StandardTokenizer. > > I believe that is Erik's point. StandardTokenizer is a public class that many > people use directly (specifically: every one who has ever posted a question > about changing the behavior of StandardAnalyzer and been given the stock > answer "write your own Analyzer that uses the same Tokenizer and > change/adds the list of TokenFilters. Aha. My JavaCC-skills aren't that great. I'll look in to it. I presume something like isTokenizingPuctuation() && token = <PUNCTUATION> | is possible.
        Hide
        Karl Wettin added a comment -

        artifact

        Show
        Karl Wettin added a comment - artifact

          People

          • Assignee:
            Unassigned
            Reporter:
            Karl Wettin
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development