Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-906

Elision filter for simple french analyzing

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      If you don't wont to use stemming, StandardAnalyzer miss some french strangeness like elision.
      "l'avion" wich means "the plane" must be tokenized as "avion" (plane).
      This filter could be used with other latin language if elision exists.

      1. elision.patch
        4 kB
        Mathieu Lecarme
      2. elision-0.2.patch
        4 kB
        Mathieu Lecarme

        Activity

        Hide
        hossman Hoss Man added a comment -

        i don't know much about french, but a few comments...

        1) "stopwords" seems like an odd name for what i would think of as a "prefix" .. you may want an example in the javadocs to make it clear.

        2) are Elison's always lowercase? I imagine there should be an ignoreCase option just like StopFilter has. (note that toLowerCase() is hardcoded in the next() method, but nothing ensures that the stopwords list is lowercased)

        3) are there any other characters that can appear between an elision and it's root word besides '\'' ? (i'm particularly wondering about other unicode characters that look like byte 0x27 but are not actually 0x27)

        4) this probably doesn't need to be in it's own contrib. contrib/analyzers should be fine .... if Elison's are specific to french, then contrib/analyzers/src/java/org/apache/lucene/analysis/fr/ makes the most sense, otherwise it might make sense to add a new subpackage under analysis ... "linguistics" perhaps (in contrast to the existing "ngram") ?

        Show
        hossman Hoss Man added a comment - i don't know much about french, but a few comments... 1) "stopwords" seems like an odd name for what i would think of as a "prefix" .. you may want an example in the javadocs to make it clear. 2) are Elison's always lowercase? I imagine there should be an ignoreCase option just like StopFilter has. (note that toLowerCase() is hardcoded in the next() method, but nothing ensures that the stopwords list is lowercased) 3) are there any other characters that can appear between an elision and it's root word besides '\'' ? (i'm particularly wondering about other unicode characters that look like byte 0x27 but are not actually 0x27) 4) this probably doesn't need to be in it's own contrib. contrib/analyzers should be fine .... if Elison's are specific to french, then contrib/analyzers/src/java/org/apache/lucene/analysis/fr/ makes the most sense, otherwise it might make sense to add a new subpackage under analysis ... "linguistics" perhaps (in contrast to the existing "ngram") ?
        Hide
        athoune Mathieu Lecarme added a comment -

        All suggested corrections are done.

        Show
        athoune Mathieu Lecarme added a comment - All suggested corrections are done.
        Hide
        athoune Mathieu Lecarme added a comment -

        All suggested corrections are done.

        Show
        athoune Mathieu Lecarme added a comment - All suggested corrections are done.
        Hide
        otis Otis Gospodnetic added a comment -

        Patch applied, thanks.
        I reformatted the code to match Lucene style.
        I also put the Apache license on top of both files.

        Thanks!

        Show
        otis Otis Gospodnetic added a comment - Patch applied, thanks. I reformatted the code to match Lucene style. I also put the Apache license on top of both files. Thanks!

          People

          • Assignee:
            otis Otis Gospodnetic
            Reporter:
            athoune Mathieu Lecarme
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development