Lucene - Core
  1. Lucene - Core
  2. LUCENE-5353

ShingleFilter should have a way to specify FILLER_TOKEN

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.6
    • Fix Version/s: 4.7, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Today we have no choice that if pos_inc is > 1 there will be a `_` inserted in between the tokens. We should have the ability to change this character and the char[] that holds it should not be public static since it's mutable.

      1. LUCENE-5353.patch
        18 kB
        Steve Rowe
      2. LUCENE-5353.patch
        15 kB
        Steve Rowe
      3. LUCENE-5353.patch
        7 kB
        Steve Rowe
      4. LUCENE-5353.patch
        6 kB
        Steve Rowe
      5. LUCENE-5353.patch
        6 kB
        Ahmet Arslan

        Activity

        Hide
        Michael McCandless added a comment -

        +1

        Show
        Michael McCandless added a comment - +1
        Hide
        Steve Rowe added a comment - - edited

        Thanks for the patch, Ahmet.

        Attaching a new patch with a few minor mods:

        1. setFillerToken() now takes a String instead of a char[], which was awkward.
        2. when arg to setFillerToken() is null, set the filler token to new char[0].
        3. beefed up javadocs
        4. added tests for zero-length and null args to setFillerToken().

        I think it's ready to go.

        Show
        Steve Rowe added a comment - - edited Thanks for the patch, Ahmet. Attaching a new patch with a few minor mods: setFillerToken() now takes a String instead of a char[] , which was awkward. when arg to setFillerToken() is null , set the filler token to new char[0] . beefed up javadocs added tests for zero-length and null args to setFillerToken() . I think it's ready to go.
        Hide
        Ahmet Arslan added a comment -

        Thanks for looking into this, Steve. It seems that you accidentally attached the same patch.

        Show
        Ahmet Arslan added a comment - Thanks for looking into this, Steve. It seems that you accidentally attached the same patch.
        Hide
        Steve Rowe added a comment -

        Thanks for looking into this, Steve. It seems that you accidentally attached the same patch.

        Oops - here's the correct patch, thanks for letting me know.

        Show
        Steve Rowe added a comment - Thanks for looking into this, Steve. It seems that you accidentally attached the same patch. Oops - here's the correct patch, thanks for letting me know.
        Hide
        Simon Willnauer added a comment -

        can we maybe try to not add setters to those Filters. It's generally not a good idea to have any mutators on these classes IMO there should be a ctor that take these parameters and that entire thing should be immutable. I also don't think we should keep the public static char array since it's mutable as well.

        Show
        Simon Willnauer added a comment - can we maybe try to not add setters to those Filters. It's generally not a good idea to have any mutators on these classes IMO there should be a ctor that take these parameters and that entire thing should be immutable. I also don't think we should keep the public static char array since it's mutable as well.
        Hide
        Steve Rowe added a comment - - edited

        can we maybe try to not add setters to those Filters. It's generally not a good idea to have any mutators on these classes IMO there should be a ctor that take these parameters and that entire thing should be immutable.

        I agree in theory, but this class has other options which should get the same treatment, and presumably other filters have the same issue. Can you make another issue to do across-the-board cleanup?

        I also don't think we should keep the public static char array since it's mutable as well.

        The public static is the default value, so it needs to be there... why not keep it public? Would renaming it to "default"<whatever> do it for you?

        Show
        Steve Rowe added a comment - - edited can we maybe try to not add setters to those Filters. It's generally not a good idea to have any mutators on these classes IMO there should be a ctor that take these parameters and that entire thing should be immutable. I agree in theory, but this class has other options which should get the same treatment, and presumably other filters have the same issue. Can you make another issue to do across-the-board cleanup? I also don't think we should keep the public static char array since it's mutable as well. The public static is the default value, so it needs to be there... why not keep it public? Would renaming it to "default"<whatever> do it for you?
        Hide
        Steve Rowe added a comment -

        Also, Simon, switching from setters to ctor args means either a) have a zillion permutations of ctor args to support not specifying some subset of args w/ defaults, or b) doing away with defaults. I don't like either of those "solutions".

        Show
        Steve Rowe added a comment - Also, Simon, switching from setters to ctor args means either a) have a zillion permutations of ctor args to support not specifying some subset of args w/ defaults, or b) doing away with defaults. I don't like either of those "solutions".
        Hide
        Simon Willnauer added a comment -

        I agree in theory, but this class has other options which should get the same treatment, and presumably other filters have the same issue. Can you make another issue to do across-the-board cleanup?

        that is fine for existing stuff but new stuff should not got down that path. We can have one large ctor for all the other options that are rare no need for ton's of permutations.

        The public static is the default value, so it needs to be there... why not keep it public? Would renaming it to "default"<whatever> do it for you?

        just make it a public static final String and call toCharArray on it no need to expose the impl details that this is a char array.

        Show
        Simon Willnauer added a comment - I agree in theory, but this class has other options which should get the same treatment, and presumably other filters have the same issue. Can you make another issue to do across-the-board cleanup? that is fine for existing stuff but new stuff should not got down that path. We can have one large ctor for all the other options that are rare no need for ton's of permutations. The public static is the default value, so it needs to be there... why not keep it public? Would renaming it to "default"<whatever> do it for you? just make it a public static final String and call toCharArray on it no need to expose the impl details that this is a char array.
        Hide
        Steve Rowe added a comment -

        I agree in theory, but this class has other options which should get the same treatment, and presumably other filters have the same issue. Can you make another issue to do across-the-board cleanup?

        that is fine for existing stuff but new stuff should not got down that path. We can have one large ctor for all the other options that are rare no need for ton's of permutations.

        I disagree - I don't want to commit a half-assed fix here. I think the way to go is make another issue to do across-the-board cleanup, on trunk, and deprecate the existing mess on branch_4x.

        Your way makes it harder to use this option via a mile-long ctor, and the class is still not immutable. Extra user effort for zero benefit. No thanks.

        The public static is the default value, so it needs to be there... why not keep it public? Would renaming it to "default"<whatever> do it for you?

        just make it a public static final String and call toCharArray on it no need to expose the impl details that this is a char array.

        I agree, it already is public static final, and it should be a String.

        Show
        Steve Rowe added a comment - I agree in theory, but this class has other options which should get the same treatment, and presumably other filters have the same issue. Can you make another issue to do across-the-board cleanup? that is fine for existing stuff but new stuff should not got down that path. We can have one large ctor for all the other options that are rare no need for ton's of permutations. I disagree - I don't want to commit a half-assed fix here. I think the way to go is make another issue to do across-the-board cleanup, on trunk, and deprecate the existing mess on branch_4x. Your way makes it harder to use this option via a mile-long ctor, and the class is still not immutable. Extra user effort for zero benefit. No thanks. The public static is the default value, so it needs to be there... why not keep it public? Would renaming it to "default"<whatever> do it for you? just make it a public static final String and call toCharArray on it no need to expose the impl details that this is a char array. I agree, it already is public static final, and it should be a String.
        Hide
        Simon Willnauer added a comment -

        I disagree - I don't want to commit a half-assed fix here. I think the way to go is make another issue to do across-the-board cleanup, on trunk, and deprecate the existing mess on branch_4x.

        I disagree here since a global cleanup is going to take time so I think progress is better than perfection take the chance and clean it up now. But I am not going to block it though. I am more concerned about the char[] stuff and those should be strings though.

        Show
        Simon Willnauer added a comment - I disagree - I don't want to commit a half-assed fix here. I think the way to go is make another issue to do across-the-board cleanup, on trunk, and deprecate the existing mess on branch_4x. I disagree here since a global cleanup is going to take time so I think progress is better than perfection take the chance and clean it up now. But I am not going to block it though. I am more concerned about the char[] stuff and those should be strings though.
        Hide
        Steve Rowe added a comment -

        I disagree here since a global cleanup is going to take time so I think progress is better than perfection take the chance and clean it up now.

        Okay, in that spirit, I'll add the mile-long ctor, and deprecate all the setters, including the new setFillerToken(), and I'll remove the setters when I commit on trunk.

        I am more concerned about the char[] stuff and those should be strings though.

        You said "those", implying multiple char[] stuff, but I think there's only one - are you thinking of something other than this?:

        public static final char[] FILLER_TOKEN = { '_' };
        

        I'll post a new patch in a minute.

        Show
        Steve Rowe added a comment - I disagree here since a global cleanup is going to take time so I think progress is better than perfection take the chance and clean it up now. Okay, in that spirit, I'll add the mile-long ctor, and deprecate all the setters, including the new setFillerToken() , and I'll remove the setters when I commit on trunk. I am more concerned about the char[] stuff and those should be strings though. You said "those", implying multiple char[] stuff, but I think there's only one - are you thinking of something other than this?: public static final char [] FILLER_TOKEN = { '_' }; I'll post a new patch in a minute.
        Hide
        Simon Willnauer added a comment -

        yeah sorry I meant the FILLER_TOKEN sry!

        Ah you don't need to I mean we can clean this one up in a different issue but I don't think we need to do all TokenFilters just because we wanna fix this one..

        Show
        Simon Willnauer added a comment - yeah sorry I meant the FILLER_TOKEN sry! Ah you don't need to I mean we can clean this one up in a different issue but I don't think we need to do all TokenFilters just because we wanna fix this one..
        Hide
        Steve Rowe added a comment -

        Ah you don't need to I mean we can clean this one up in a different issue but I don't think we need to do all TokenFilters just because we wanna fix this one..

        Makes sense, I'll create an issue for fixing all of them, and attach a patch for just ShingleFilter. Relatedly, I see Chris Male made ShingleAnalyzerWrapper immutable in LUCENE-3434.

        Attaching a hopefully final patch with these changes:

        1. char[] FILLER_TOKEN -> String DEFAULT_FILLER_TOKEN
        2. setTokenFiller() -> setFillerToken() (I blame my latent dyslexia for not noticing this one sooner...)
        3. TOKEN_SEPARATOR -> DEFAULT_TOKEN_SEPARATOR (it was the only public static final default without the DEFAULT_ prefix)
        4. Added fillerToken support to ShingleAnalyzerWrapper/Test
        Show
        Steve Rowe added a comment - Ah you don't need to I mean we can clean this one up in a different issue but I don't think we need to do all TokenFilters just because we wanna fix this one.. Makes sense, I'll create an issue for fixing all of them, and attach a patch for just ShingleFilter. Relatedly, I see Chris Male made ShingleAnalyzerWrapper immutable in LUCENE-3434 . Attaching a hopefully final patch with these changes: char[] FILLER_TOKEN -> String DEFAULT_FILLER_TOKEN setTokenFiller() -> setFillerToken() (I blame my latent dyslexia for not noticing this one sooner...) TOKEN_SEPARATOR -> DEFAULT_TOKEN_SEPARATOR (it was the only public static final default without the DEFAULT_ prefix) Added fillerToken support to ShingleAnalyzerWrapper / Test
        Hide
        Simon Willnauer added a comment -

        LGTM

        Show
        Simon Willnauer added a comment - LGTM
        Hide
        Steve Rowe added a comment -

        Final final patch attached - I added explicit fillerToken tests to ShingleAnalyzerWrapperTest.

        Committing shortly.

        Show
        Steve Rowe added a comment - Final final patch attached - I added explicit fillerToken tests to ShingleAnalyzerWrapperTest . Committing shortly.
        Hide
        ASF subversion and git services added a comment -

        Commit 1562639 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1562639 ]

        LUCENE-5353: ShingleFilter's filler token should be configurable

        Show
        ASF subversion and git services added a comment - Commit 1562639 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1562639 ] LUCENE-5353 : ShingleFilter's filler token should be configurable
        Hide
        ASF subversion and git services added a comment -

        Commit 1562647 from Steve Rowe in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1562647 ]

        LUCENE-5353: ShingleFilter's filler token should be configurable (merged trunk r1562639)

        Show
        ASF subversion and git services added a comment - Commit 1562647 from Steve Rowe in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1562647 ] LUCENE-5353 : ShingleFilter's filler token should be configurable (merged trunk r1562639)
        Hide
        Steve Rowe added a comment -

        Committed to trunk and branch_4x.

        Thanks Ahmet and Simon!

        Show
        Steve Rowe added a comment - Committed to trunk and branch_4x. Thanks Ahmet and Simon!

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Simon Willnauer
          • Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development