Lucene - Core
  1. Lucene - Core
  2. LUCENE-4911

Missing word "cela" in conf/lang/stopwords_fr.txt

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 4.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      NB: Not sure this defect is assigned to the right component.

      In file example/solr/collection1/conf/lang/stopwords_fr.txt,
      there is the word "celà". Though incorrect in French (cf http://fr.wiktionary.org/wiki/cel%C3%A0), it's common, but we may also add the correct spelling (e.g. "cela", whitout accent) to that stopwords list.

      Another thing: I noticed that "celà" is the only word of the list followed by an unbreakable space. Is that wanted?

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        1d 5h 25m 1 Adrien Grand 06/Apr/13 16:12
        Hide
        Iksnalybok added a comment -

        Thanks

        Show
        Iksnalybok added a comment - Thanks
        Hide
        Adrien Grand added a comment -

        For your information, Martin Porter (himself!) added cela to the upstream stop list (http://lists.tartarus.org/mailman/private/snowball-discuss/2013-April/001466.html).

        Show
        Adrien Grand added a comment - For your information, Martin Porter (himself!) added cela to the upstream stop list ( http://lists.tartarus.org/mailman/private/snowball-discuss/2013-April/001466.html ).
        Adrien Grand made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Adrien Grand added a comment -

        Pierre, I just applied your patch to Lucene's stop list (http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt?view=diff&r1=1465255&r2=1465256&pathrev=1465256). Thank you! This fix should be available in Lucene/Solr 4.3.

        I also sent an email to snowball-discuss to mention this improvement: http://lists.tartarus.org/mailman/private/snowball-discuss/2013-April/001462.html

        Show
        Adrien Grand added a comment - Pierre, I just applied your patch to Lucene's stop list ( http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt?view=diff&r1=1465255&r2=1465256&pathrev=1465256 ). Thank you! This fix should be available in Lucene/Solr 4.3. I also sent an email to snowball-discuss to mention this improvement: http://lists.tartarus.org/mailman/private/snowball-discuss/2013-April/001462.html
        Adrien Grand made changes -
        Project Solr [ 12310230 ] Lucene - Core [ 12310110 ]
        Key SOLR-4678 LUCENE-4911
        Affects Version/s 4.2 [ 12323899 ]
        Affects Version/s 4.2 [ 12323893 ]
        Lucene Fields New,Patch Available [ 10121, 10120 ]
        Component/s Schema and Analysis [ 12312520 ]
        Adrien Grand made changes -
        Assignee Adrien Grand [ jpountz ]
        Hide
        Robert Muir added a comment -

        Thanks Pierre: Actually this file is synchronized from lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt (via a ant task from solr/ 'ant sync-analyzers')

        I think we should patch this file so its in the default lucene stoplist, too.

        It might also be a good idea for us to send an email about this to the snowball list (snowball-discuss@lists.tartarus.org) as thats where this file came from, they might be interested in the improvement, too.

        Show
        Robert Muir added a comment - Thanks Pierre: Actually this file is synchronized from lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt (via a ant task from solr/ 'ant sync-analyzers') I think we should patch this file so its in the default lucene stoplist, too. It might also be a good idea for us to send an email about this to the snowball list (snowball-discuss@lists.tartarus.org) as thats where this file came from, they might be interested in the improvement, too.
        Hide
        Iksnalybok added a comment -

        Patch added.

        Show
        Iksnalybok added a comment - Patch added.
        Iksnalybok made changes -
        Attachment stopwords_fr.txt.patch [ 12577211 ]
        Hide
        Adrien Grand added a comment -

        Indeed, we should indeed add "cela". Can you create a patch? I don't think the unbreakable space has been added on purpose, it can be removed.

        Show
        Adrien Grand added a comment - Indeed, we should indeed add "cela". Can you create a patch? I don't think the unbreakable space has been added on purpose, it can be removed.
        Iksnalybok made changes -
        Field Original Value New Value
        Description NB: Not sure this defect is assigned to the right component.

        In file example/solr/collection1/conf/lang/stopwords_fr.txt,
        there is the word "celà". Though incorrect in French (cf http://fr.wiktionary.org/wiki/cel%C3%A0), it's common, but we may also add the correct spelling (e.g. "cela", whitout accent) to that stopwords list.
        NB: Not sure this defect is assigned to the right component.

        In file example/solr/collection1/conf/lang/stopwords_fr.txt,
        there is the word "celà". Though incorrect in French (cf http://fr.wiktionary.org/wiki/cel%C3%A0), it's common, but we may also add the correct spelling (e.g. "cela", whitout accent) to that stopwords list.

        Another thing: I noticed that "celà" is the only word of the list followed by an unbreakable space. Is that wanted?
        Iksnalybok created issue -

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Iksnalybok
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 10m
              10m
              Remaining:
              Remaining Estimate - 10m
              10m
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development