Lucene - Core
  1. Lucene - Core
  2. LUCENE-2015

ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left & right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet.

      It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

      1. ASCIIFoldingFilter-no_formatting.patch
        4 kB
        Cédrik LIME
      2. ASCIIFoldingFilter-no_formatting.patch
        2 kB
        Cédrik LIME
      3. Filters.patch
        220 kB
        Cédrik LIME
      4. ISOLatin1AccentFilter.patch
        6 kB
        Cédrik LIME
      5. LUCENE-2015.patch
        1 kB
        Cédrik LIME
      6. LUCENE-2015.patch
        1 kB
        Robert Muir

        Activity

        Hide
        Cédrik LIME added a comment -

        (UTF-8 encoding)

        Show
        Cédrik LIME added a comment - (UTF-8 encoding)
        Hide
        Robert Muir added a comment -

        Cédrik, is it possible to provide a patch without the formatting changes?

        I am having trouble seeing the changes you made to ASCIIFoldingFilter.

        btw, I think ISOLatin1AccentFilter only stays around for back compat to support old indexes, in my opinion we should not modify it for this reason.

        Show
        Robert Muir added a comment - Cédrik, is it possible to provide a patch without the formatting changes? I am having trouble seeing the changes you made to ASCIIFoldingFilter. btw, I think ISOLatin1AccentFilter only stays around for back compat to support old indexes, in my opinion we should not modify it for this reason.
        Hide
        Cédrik LIME added a comment -

        Robert,

        All I did is refactor the big switch(c) into its own method:
        public static final int foldToASCII(char c, char[] output, int outputPos)
        and change the caller (public void foldToASCII(char[] input, int length)) accordingly.

        I can submit a patch without formatting changes, but that means the source won't be nicely indented...
        Please advise.

        As for the ISOLatin1AccentFilter patch, it really is to enable us to remove a workaround for an issue we had with some special (yet frequent) chars. Feel free to ignore it should you think this part is not relevant.

        Show
        Cédrik LIME added a comment - Robert, All I did is refactor the big switch(c) into its own method: public static final int foldToASCII(char c, char[] output, int outputPos) and change the caller (public void foldToASCII(char[] input, int length)) accordingly. I can submit a patch without formatting changes, but that means the source won't be nicely indented... Please advise. As for the ISOLatin1AccentFilter patch, it really is to enable us to remove a workaround for an issue we had with some special (yet frequent) chars. Feel free to ignore it should you think this part is not relevant.
        Hide
        Robert Muir added a comment -

        Cédrik, in my opinion, it would be easier to see the patch without the formatting changes if possible.

        Even if there is bad indentation currently, I think this should be corrected in a separate patch.

        Show
        Robert Muir added a comment - Cédrik, in my opinion, it would be easier to see the patch without the formatting changes if possible. Even if there is bad indentation currently, I think this should be corrected in a separate patch.
        Hide
        Cédrik LIME added a comment -

        Here are the patches (UTF-8 encoding), 1 per filter.
        I have removed the formatting on the switch(c) in ASCIIFoldingFilter for easier review.

        Show
        Cédrik LIME added a comment - Here are the patches (UTF-8 encoding), 1 per filter. I have removed the formatting on the switch(c) in ASCIIFoldingFilter for easier review.
        Hide
        Robert Muir added a comment -

        Cédrik,

        I think adding the idea of adding a public static method for folding is OK. but I think it should essentially do what foldToAscii does, not operate on a single 'char'.

        we should avoid single 'char' as parameter arguments, instead it should work on the entire char[] I think ?

        Show
        Robert Muir added a comment - Cédrik, I think adding the idea of adding a public static method for folding is OK. but I think it should essentially do what foldToAscii does, not operate on a single 'char'. we should avoid single 'char' as parameter arguments, instead it should work on the entire char[] I think ?
        Hide
        Cédrik LIME added a comment -

        Indeed, and that was my primary (internal) patch.
        But then you loose the shared "output" buffer between incrementToken() calls, and you end up creating char[]'s like there is no tomorrow, which may be a performance regression.

        What I can do is /add/ a static method that operates on a char[], for convenient external use.
        What do you think?

        Show
        Cédrik LIME added a comment - Indeed, and that was my primary (internal) patch. But then you loose the shared "output" buffer between incrementToken() calls, and you end up creating char[]'s like there is no tomorrow, which may be a performance regression. What I can do is /add/ a static method that operates on a char[], for convenient external use. What do you think?
        Hide
        Robert Muir added a comment -

        Cédrik, why would you create char[]'s like there is no tomorrow if you add a static method that operates on char[], for external use, but also use this within the incrementToken(), passing the tokenBuffer as an argument?

        Show
        Robert Muir added a comment - Cédrik, why would you create char[]'s like there is no tomorrow if you add a static method that operates on char[], for external use, but also use this within the incrementToken(), passing the tokenBuffer as an argument?
        Hide
        Uwe Schindler added a comment -

        We cannot apply the patch to ISOLatin1Filter, as it would break indexes already using it. Because of that we migrated to ASCIIFoldingFilter and kept ISOLatin1Filter alive. So we should leave it as it is.

        To the buffer problem: For easy external use we could also provide a expert API that works like the current public foldToASCII method, which is memory efficient. But may also provide String/StringBuilder converters for external use. Internal it cannot be better as it currently is

        Show
        Uwe Schindler added a comment - We cannot apply the patch to ISOLatin1Filter, as it would break indexes already using it. Because of that we migrated to ASCIIFoldingFilter and kept ISOLatin1Filter alive. So we should leave it as it is. To the buffer problem: For easy external use we could also provide a expert API that works like the current public foldToASCII method, which is memory efficient. But may also provide String/StringBuilder converters for external use. Internal it cannot be better as it currently is
        Hide
        Cédrik LIME added a comment -

        Uwe,

        ISOLatin1AccentFilter was already modified in Lucene 2.4: see LUCENE-1351

        As for ASCIIFoldingFilter, I will take a second shot at an expert API next week. Stay tuned!

        Show
        Cédrik LIME added a comment - Uwe, ISOLatin1AccentFilter was already modified in Lucene 2.4: see LUCENE-1351 As for ASCIIFoldingFilter, I will take a second shot at an expert API next week. Stay tuned!
        Hide
        Robert Muir added a comment -

        ISOLatin1AccentFilter was already modified in Lucene 2.4: see LUCENE-1351

        that's interesting, so if someone has a < Lucene 2.4 index built with this filter, its currently not compatible...
        I guess no one has complained but there could be some conditional logic based on Version to support those indexes...

        Show
        Robert Muir added a comment - ISOLatin1AccentFilter was already modified in Lucene 2.4: see LUCENE-1351 that's interesting, so if someone has a < Lucene 2.4 index built with this filter, its currently not compatible... I guess no one has complained but there could be some conditional logic based on Version to support those indexes...
        Hide
        Uwe Schindler added a comment -

        I would leave ISOLatin1AccentFilter as it is. No version logic for already deprecated classes, they are deprecated, so no support any more. Normally we would have removed it in 3.0, it is really only be there to support old indexes, so no new features. If until now, nobody complained, we do not need to care. Maybe the modifications were so special, that only some of the term in such indexes were affected and nobody realized that difference.

        Show
        Uwe Schindler added a comment - I would leave ISOLatin1AccentFilter as it is. No version logic for already deprecated classes, they are deprecated, so no support any more. Normally we would have removed it in 3.0, it is really only be there to support old indexes, so no new features. If until now, nobody complained, we do not need to care. Maybe the modifications were so special, that only some of the term in such indexes were affected and nobody realized that difference.
        Hide
        Michael McCandless added a comment -

        I think those changes to ISOLatin1AccentFilter predated our Version logic... I agree that had Version been around we probably should have used it.

        Show
        Michael McCandless added a comment - I think those changes to ISOLatin1AccentFilter predated our Version logic... I agree that had Version been around we probably should have used it.
        Hide
        Cédrik LIME added a comment -

        As suggested by Robert, here is a new version of the ASCIIFoldingFilter patch which exposes the folding logic.
        I have added 2 convenience methods that can operate on a char[] and on a CharSequence.

        Show
        Cédrik LIME added a comment - As suggested by Robert, here is a new version of the ASCIIFoldingFilter patch which exposes the folding logic. I have added 2 convenience methods that can operate on a char[] and on a CharSequence.
        Hide
        Robert Muir added a comment -

        Cédrik, thanks!

        at a glance this looks good to me... can look at it more thoroughly later, i am heading out of town.

        Show
        Robert Muir added a comment - Cédrik, thanks! at a glance this looks good to me... can look at it more thoroughly later, i am heading out of town.
        Hide
        Mark Miller added a comment -

        For this type of stuff "no one has complained" doesn't mean much - thats why these changes are so insidious - they are easy not to notice - docs just disappear, and users likely don't know they ever existed. For some apps this is absolutely disastrous.

        We prob should have been more careful with 1351 and more careful in the future.

        Show
        Mark Miller added a comment - For this type of stuff "no one has complained" doesn't mean much - thats why these changes are so insidious - they are easy not to notice - docs just disappear, and users likely don't know they ever existed. For some apps this is absolutely disastrous. We prob should have been more careful with 1351 and more careful in the future.
        Hide
        Cédrik LIME added a comment -

        Robert, any news on this patch? Can we get it applied for Lucene 3.1?

        Show
        Cédrik LIME added a comment - Robert, any news on this patch? Can we get it applied for Lucene 3.1?
        Hide
        Robert Muir added a comment -

        Cédrik: i brought the patch up to date, but modified it slightly.

        The reasoning is, I would prefer if we just expose one method in this case.

        Show
        Robert Muir added a comment - Cédrik: i brought the patch up to date, but modified it slightly. The reasoning is, I would prefer if we just expose one method in this case.
        Hide
        Cédrik LIME added a comment -

        Robert: I liked the dual approach (fold 1 char / a char[]) as it offered maximum flexibility (folding a String didn't incur a systematic copy of the input as toCharArray() does, I could use charAt() in a loop).
        Nevertheless, I will be happy with a single method if this is your preferred approach.

        I have updated your patch slightly to model the API after System.arraycopy(), which makes it a bit more flexible and easier to use:

        • added offset for output
        • shuffled the arguments order to mimic System.arraycopy()
        • updated JavaDoc
        Show
        Cédrik LIME added a comment - Robert: I liked the dual approach (fold 1 char / a char[] ) as it offered maximum flexibility (folding a String didn't incur a systematic copy of the input as toCharArray() does, I could use charAt() in a loop). Nevertheless, I will be happy with a single method if this is your preferred approach. I have updated your patch slightly to model the API after System.arraycopy() , which makes it a bit more flexible and easier to use: added offset for output shuffled the arguments order to mimic System.arraycopy() updated JavaDoc
        Hide
        Robert Muir added a comment -

        Thanks Cédrik, I like your latest change.

        My primary reasoning for minimizing the API is because each exposed
        method has some cost to us (backwards compatibility).

        I think if someone wants to fold a String they can still work with this API,
        e.g. use a char[1] container, and not even bother if charAt() < 0x7F, etc.

        In general I guess i am less concerned about this as the Lucene API
        doesn't use String.

        I will commit in a day or two if no one objects.

        Show
        Robert Muir added a comment - Thanks Cédrik, I like your latest change. My primary reasoning for minimizing the API is because each exposed method has some cost to us (backwards compatibility). I think if someone wants to fold a String they can still work with this API, e.g. use a char [1] container, and not even bother if charAt() < 0x7F, etc. In general I guess i am less concerned about this as the Lucene API doesn't use String. I will commit in a day or two if no one objects.
        Hide
        Robert Muir added a comment -

        Committed revision 922277.

        Thanks Cédrik!

        Show
        Robert Muir added a comment - Committed revision 922277. Thanks Cédrik!

          People

          • Assignee:
            Robert Muir
            Reporter:
            Cédrik LIME
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development