Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.3
    • Component/s: search
    • Labels:
      None

      Description

      It wouldn't hurt Solr (StopFilterFactory) to allow one to specify multiple stopword files.
      I've patched Solr to support this, for example:

      <filter class="solr.StopFilterFactory" ignoreCase="true" words="hr_stopwords.txt, hr_stopmorphemes.txt"/>

      I'll upload a patch shortly and commit later this week.

      1. SOLR-438.patch
        1 kB
        Otis Gospodnetic
      2. SOLR-438.patch
        2 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          Seems like this could be a more general feature for all the one-entry-per-line type files (synonyms and protected words).

          Show
          Yonik Seeley added a comment - Seems like this could be a more general feature for all the one-entry-per-line type files (synonyms and protected words).
          Hide
          Hoss Man added a comment -

          haven't looked at the patch, agree with the idea in principle, but concerned about how we can make the syntax work safely .... comma and space are both legal filename characters.

          Show
          Hoss Man added a comment - haven't looked at the patch, agree with the idea in principle, but concerned about how we can make the syntax work safely .... comma and space are both legal filename characters.
          Hide
          Otis Gospodnetic added a comment -

          Yonik - I agree. Is there a different place/class where this could be added then? I didn't spot one.

          Hoss - I knew somebody would say that. I initially wanted to go with File.pathSeparator as the delimiter, but then thought about example config files and how they wouldn't work out of the box if the we used, say, ";" and the person is using Winblows. Got a suggestion for that? Or we can simply document: "Commas in filenames verboten!"

          Show
          Otis Gospodnetic added a comment - Yonik - I agree. Is there a different place/class where this could be added then? I didn't spot one. Hoss - I knew somebody would say that. I initially wanted to go with File.pathSeparator as the delimiter, but then thought about example config files and how they wouldn't work out of the box if the we used, say, ";" and the person is using Winblows. Got a suggestion for that? Or we can simply document: "Commas in filenames verboten!"
          Hide
          Ryan McKinley added a comment -

          I'm not sure either is a good idea, but i'll throw it out there just for arguments sake:

          Option 1: check the arguments for names that start with "words":

           <filter ... words="stop1.txt" words01="stop2.txt" />
          

          Option 2: change MapInitalizedPlugin from Map<String,String> to Map<String,Object> (backwards compitable), then this could be:

           <filter ... words="stop1.txt" words="stop2.txt" />
          

          the plugin loader could make a List<String> if the attribute shows up twice. I'm sure that breaks some XML spec somewhere though...

          Show
          Ryan McKinley added a comment - I'm not sure either is a good idea, but i'll throw it out there just for arguments sake: Option 1: check the arguments for names that start with "words": <filter ... words= "stop1.txt" words01= "stop2.txt" /> Option 2: change MapInitalizedPlugin from Map<String,String> to Map<String,Object> (backwards compitable), then this could be: <filter ... words= "stop1.txt" words= "stop2.txt" /> the plugin loader could make a List<String> if the attribute shows up twice. I'm sure that breaks some XML spec somewhere though...
          Hide
          Ryan McKinley added a comment -

          duplicate attribute names are bad: (no kidding)
          http://www.w3.org/TR/1999/REC-xml-names-19990114/#uniqAttrs

          Again, I'm not sure it is a good idea, but maybe a Filter/Tokenizer could optionally implement NamedListInitalizedPlugin, then it could be:

          <filter class="solr.StopFilterFactory" >
           <arr name="words">
            <str name="stop1.txt"/>
            <str name="stop2.txt"/>
           </arr>
          </filter>
          

          kinda ugly.

          again, shooting from the hip, encode a JSON list in the value?

            <filter ... wordfilelist="{'stop1.txt', 'stop2.txt'}" />
          
          Show
          Ryan McKinley added a comment - duplicate attribute names are bad: (no kidding) http://www.w3.org/TR/1999/REC-xml-names-19990114/#uniqAttrs Again, I'm not sure it is a good idea, but maybe a Filter/Tokenizer could optionally implement NamedListInitalizedPlugin, then it could be: <filter class= "solr.StopFilterFactory" > <arr name= "words" > <str name= "stop1.txt" /> <str name= "stop2.txt" /> </arr> </filter> kinda ugly. again, shooting from the hip, encode a JSON list in the value? <filter ... wordfilelist= "{'stop1.txt', 'stop2.txt'}" />
          Hide
          Mike Klaas added a comment -

          If we're going for ugly: pick a filename delimiter. First test to see if the file exists as a whole string (including all delimiters) and if it doesn't try splitting.

          Special-case behaviour like that is rather kludgy, though.

          Show
          Mike Klaas added a comment - If we're going for ugly: pick a filename delimiter. First test to see if the file exists as a whole string (including all delimiters) and if it doesn't try splitting. Special-case behaviour like that is rather kludgy, though.
          Hide
          Yonik Seeley added a comment -

          What about standard backslash escaping?

          Show
          Yonik Seeley added a comment - What about standard backslash escaping?
          Hide
          Ryan McKinley added a comment -

          or maybe:

          <filter ... words="stop.txt,stop2.txt" words-delimiter="," />
          

          In general, the existence of xxx-delimiter would mean split xxx on that char to make a list.

          Show
          Ryan McKinley added a comment - or maybe: <filter ... words= "stop.txt,stop2.txt" words-delimiter= "," /> In general, the existence of xxx-delimiter would mean split xxx on that char to make a list.
          Hide
          Hoss Man added a comment -

          > I initially wanted to go with File.pathSeparator as the delimiter, but then thought
          > about example config files and how they wouldn't work out of the box if the we used,
          > say, ";" and the person is using Winblows. Got a suggestion for that?

          I think using File.pathSeparator makes perfect sense ... we just wouldn't use multiple files in the example configs (but can include a comment mentioning that it is possible)

          Show
          Hoss Man added a comment - > I initially wanted to go with File.pathSeparator as the delimiter, but then thought > about example config files and how they wouldn't work out of the box if the we used, > say, ";" and the person is using Winblows. Got a suggestion for that? I think using File.pathSeparator makes perfect sense ... we just wouldn't use multiple files in the example configs (but can include a comment mentioning that it is possible)
          Hide
          Otis Gospodnetic added a comment -

          Hoss - duh, why didn't I think of that? That sounds good to me. That's what comments are for.

          Before I read this I was going to do (new File(attribVal).exists()) type of a check if the attribVal has a comma and split only if file doesn't exist. But I like your suggestion better - keep the code clean.

          But there is currently no other "common" place for this, right?
          Maybe add String[] getStrings(String) to BaseTokenFilterFactory ?

          Show
          Otis Gospodnetic added a comment - Hoss - duh, why didn't I think of that? That sounds good to me. That's what comments are for. Before I read this I was going to do (new File(attribVal).exists()) type of a check if the attribVal has a comma and split only if file doesn't exist. But I like your suggestion better - keep the code clean. But there is currently no other "common" place for this, right? Maybe add String[] getStrings(String) to BaseTokenFilterFactory ?
          Hide
          Ryan McKinley added a comment -

          the problem with using File.pathSeparator is that you would need a different config to run the same thing on unix vs windows. I develop on windows and deploy on linux – how would that work?

          Show
          Ryan McKinley added a comment - the problem with using File.pathSeparator is that you would need a different config to run the same thing on unix vs windows. I develop on windows and deploy on linux – how would that work?
          Hide
          Yonik Seeley added a comment -

          File.pathSeparator makes it platform specific and doesn't solve the problem of what to do if the separator is in the filename (it's a Java concept not an OS concept), so you still have to do escaping if you want to support all filenames.

          I'd just pick a logical separator ("," seemed fine to me) and allow backslash escaping in the unlikely event that the filename is really weird. Any bets that no one has configured a word list filename with a "," in it anyway?

          Show
          Yonik Seeley added a comment - File.pathSeparator makes it platform specific and doesn't solve the problem of what to do if the separator is in the filename (it's a Java concept not an OS concept), so you still have to do escaping if you want to support all filenames. I'd just pick a logical separator ("," seemed fine to me) and allow backslash escaping in the unlikely event that the filename is really weird. Any bets that no one has configured a word list filename with a "," in it anyway?
          Hide
          Shalin Shekhar Mangar added a comment -

          This patch tests if the whole string points to a valid file. If not, it uses comma as a separater for multiple files. Allows a preceding backslash to escape a comma in a file name.

          Show
          Shalin Shekhar Mangar added a comment - This patch tests if the whole string points to a valid file. If not, it uses comma as a separater for multiple files. Allows a preceding backslash to escape a comma in a file name.
          Hide
          Shalin Shekhar Mangar added a comment -

          I've opened SOLR-663 to add this general capability or all the one-entry-per-line type files as per Yonik's comment.

          Otis – I think the best common place for the split code would be org.apache.solr.common.util.StrUtils which already has some splitSmart methods.

          Show
          Shalin Shekhar Mangar added a comment - I've opened SOLR-663 to add this general capability or all the one-entry-per-line type files as per Yonik's comment. Otis – I think the best common place for the split code would be org.apache.solr.common.util.StrUtils which already has some splitSmart methods.
          Hide
          Noble Paul added a comment -

          just pick a logical separator ("," seemed fine to me) and allow backslash escaping in the unlikely event that the filename is really weird. Any bets that no one has configured a word list filename with a "," in it anyway?

          comma is no problem .It looks very intuitive. Though it is a valid filanem character it is OK to have a limitation. User can easily name his files accordingly.
          The new replication SOLR-561 uses conf file names as comma separated values. If comma is used in names that also will fail

          Show
          Noble Paul added a comment - just pick a logical separator ("," seemed fine to me) and allow backslash escaping in the unlikely event that the filename is really weird. Any bets that no one has configured a word list filename with a "," in it anyway? comma is no problem .It looks very intuitive. Though it is a valid filanem character it is OK to have a limitation. User can easily name his files accordingly. The new replication SOLR-561 uses conf file names as comma separated values. If comma is used in names that also will fail
          Hide
          Shalin Shekhar Mangar added a comment -

          SOLR-663 has been committed. We can mark this issue as resolved.

          Show
          Shalin Shekhar Mangar added a comment - SOLR-663 has been committed. We can mark this issue as resolved.

            People

            • Assignee:
              Otis Gospodnetic
              Reporter:
              Otis Gospodnetic
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development