Lucene - Core
  1. Lucene - Core
  2. LUCENE-1813

Add option to ReverseStringFilter to mark reversed tokens

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      This patch implements additional functionality in the filter to "mark" reversed tokens with a special marker character (Unicode 0001). This is useful when indexing both straight and reversed tokens (e.g. to implement efficient leading wildcards search).

      1. reverseMark.patch
        5 kB
        Andrzej Bialecki
      2. reverseMark-2.patch
        3 kB
        Andrzej Bialecki
      3. LUCENE-1813.patch
        4 kB
        Robert Muir
      4. LUCENE-1813.patch
        4 kB
        Robert Muir
      5. LUCENE-1813.patch
        5 kB
        Robert Muir

        Activity

        Shai Erera made changes -
        Component/s modules/analysis [ 12310230 ]
        Component/s contrib/analyzers [ 12312333 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12562923 ] jira [ 12583804 ]
        Mark Thomas made changes -
        Workflow jira [ 12473294 ] Default workflow, editable Closed status [ 12562923 ]
        Mark Miller made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Robert Muir added a comment -

        Committed revision 805769.

        Thanks Andrzej and also everyone who provided feedback

        Show
        Robert Muir added a comment - Committed revision 805769. Thanks Andrzej and also everyone who provided feedback
        Robert Muir made changes -
        Attachment LUCENE-1813.patch [ 12416949 ]
        Hide
        Robert Muir added a comment -

        just some additional javadocs.

        Show
        Robert Muir added a comment - just some additional javadocs.
        Hide
        Robert Muir added a comment -

        please let me know if you have any feedback on the latest patch.
        if there are no comments I would like to resolve this issue tomorrow or thursday, thanks!

        Show
        Robert Muir added a comment - please let me know if you have any feedback on the latest patch. if there are no comments I would like to resolve this issue tomorrow or thursday, thanks!
        Robert Muir made changes -
        Attachment LUCENE-1813.patch [ 12416826 ]
        Hide
        Robert Muir added a comment -

        same as before (you pick your own character), but with some constants for example marker characters, ones mentioned under this issue.

        Show
        Robert Muir added a comment - same as before (you pick your own character), but with some constants for example marker characters, ones mentioned under this issue.
        Hide
        Robert Muir added a comment -

        thanks for your comments guys, I like the idea of constants for some of these suggested characters.

        i will update the patch later tonight if no one wants to tackle it and beat me to it first

        Show
        Robert Muir added a comment - thanks for your comments guys, I like the idea of constants for some of these suggested characters. i will update the patch later tonight if no one wants to tackle it and beat me to it first
        Hide
        DM Smith added a comment -

        I like the idea of a constant and it presented as a default. I suggest that others be given in the JavaDoc.

        I have some texts which are using PUAs until Unicode includes the code points (e.g. Myanmar text), so I'm glad that allowing a choice doesn't create a potential conflict there. I think PUA should be left to the text author.

        As my texts are all derived from XML, I like the use of a character that is not allowed in XML. I think 0001 is just fine, even if not from a purity perspective.

        Some of my texts have BIDI markers and while these will be stripped by filters, I don't think this use is analogous.

        Show
        DM Smith added a comment - I like the idea of a constant and it presented as a default. I suggest that others be given in the JavaDoc. I have some texts which are using PUAs until Unicode includes the code points (e.g. Myanmar text), so I'm glad that allowing a choice doesn't create a potential conflict there. I think PUA should be left to the text author. As my texts are all derived from XML, I like the use of a character that is not allowed in XML. I think 0001 is just fine, even if not from a purity perspective. Some of my texts have BIDI markers and while these will be stripped by filters, I don't think this use is analogous.
        Hide
        Grant Ingersoll added a comment -

        Perhaps it is useful to define a few constants for each of these suggested characters to make it super easy for people to use them? Just a thought. Otherwise, I like the idea of passing in your own marker.

        Show
        Grant Ingersoll added a comment - Perhaps it is useful to define a few constants for each of these suggested characters to make it super easy for people to use them? Just a thought. Otherwise, I like the idea of passing in your own marker.
        Hide
        Robert Muir added a comment -

        Ted, with the current patch you can do this: new ReverseStringFilter('\u200E'), or new ReverseStringFilter('\u200F'), or new ReverseStringFilter('\u0001'), or whatever.

        Also, for anyone using this filter its my understanding that each term in lucene's term dictionary is a "delta" versus the previous term, so the character you choose should not affect its size?

        Show
        Robert Muir added a comment - Ted, with the current patch you can do this: new ReverseStringFilter('\u200E'), or new ReverseStringFilter('\u200F'), or new ReverseStringFilter('\u0001'), or whatever. Also, for anyone using this filter its my understanding that each term in lucene's term dictionary is a "delta" versus the previous term, so the character you choose should not affect its size?
        Hide
        Ted Dunning added a comment -

        I understand the desire to use a mark that requires fewer bytes, but the unicode bidi marks might be better for the purpose of marking writing direction: (U+200E LTR or U+200F RTL)

        Show
        Ted Dunning added a comment - I understand the desire to use a mark that requires fewer bytes, but the unicode bidi marks might be better for the purpose of marking writing direction: (U+200E LTR or U+200F RTL)
        Hide
        Andrzej Bialecki added a comment -

        +1. One comment, perhaps stating the obvious .. I picked char 0001 for two reasons - it's not likely to be used in regular text, and its UTF-8 encoding uses one byte. The use case for this filter means that it will create more or less as many tokens as there were in the original token stream, thus doubling the size of term dictionary. One byte here, one byte there, and suddenly it matters whether we use 0001 or FFFF ...

        Show
        Andrzej Bialecki added a comment - +1. One comment, perhaps stating the obvious .. I picked char 0001 for two reasons - it's not likely to be used in regular text, and its UTF-8 encoding uses one byte. The use case for this filter means that it will create more or less as many tokens as there were in the original token stream, thus doubling the size of term dictionary. One byte here, one byte there, and suddenly it matters whether we use 0001 or FFFF ...
        Hide
        Robert Muir added a comment -

        Paul, thanks very much for your feedback... I think its a cleaner change now.

        Show
        Robert Muir added a comment - Paul, thanks very much for your feedback... I think its a cleaner change now.
        Hide
        Paul Cowan added a comment -

        Simple's good. I'm a happy man, thanks Robert.

        Show
        Paul Cowan added a comment - Simple's good. I'm a happy man, thanks Robert.
        Robert Muir made changes -
        Attachment LUCENE-1813.patch [ 12416732 ]
        Hide
        Robert Muir added a comment -

        updated patch so you can choose your own character for marking.

        if one character is not enough let me know (i suppose we could make it a sequence), but I'd rather keep this simple.

        Show
        Robert Muir added a comment - updated patch so you can choose your own character for marking. if one character is not enough let me know (i suppose we could make it a sequence), but I'd rather keep this simple.
        Hide
        Robert Muir added a comment -

        Paul ok, how about instead of char, a sequence

        Then you can use however many characters you want...

        Show
        Robert Muir added a comment - Paul ok, how about instead of char, a sequence Then you can use however many characters you want...
        Hide
        Paul Cowan added a comment -

        Yes, or +1 on passing in your own marker, that lets everyone choose a character that won't clash with whatever unicode subset gets used for their tokens.

        Show
        Paul Cowan added a comment - Yes, or +1 on passing in your own marker, that lets everyone choose a character that won't clash with whatever unicode subset gets used for their tokens.
        Hide
        Paul Cowan added a comment -

        OK, cool. I'm taking an interest in this purely because I have some ideas for other token filters which would do something similar, and really like the idea of tagging them in the same way just with different 'headers'. It would be really beneficial, I think, to come up with something that can be reused and, more importantly, combined (so different filters don't 'clash' with their output). What about making it 2 characters, at least?

        U+0001 START OF HEADER
        U+xxxx whatever you like to indicate 'reversing' (i.e. an 'R', or just a 0-byte as this is the first purpose allocated, or whatever)

        This adds 2 bytes to each term, not 1, but terms generally don't take up that much room in the scale of a whole index and I think it's worth the flexibility. Hell, if you're willing to use 3 (that IS starting to seem wasteful, I admit) then maybe

        U+0001 START OF HEADER
        U+xxxx whatever
        U+0002 START OF TEXT

        That's at least semantically meaningful. Other ideas, just looking at the ASCII control characters:

        U+xxxx whatever
        U+001F UNIT SEPARATOR

        or

        U+000E SHIFT OUT
        U+xxxx whatever
        U+000F SHIFT IN

        I don't really mind, but it's always nice to plan ahead.

        Show
        Paul Cowan added a comment - OK, cool. I'm taking an interest in this purely because I have some ideas for other token filters which would do something similar, and really like the idea of tagging them in the same way just with different 'headers'. It would be really beneficial, I think, to come up with something that can be reused and, more importantly, combined (so different filters don't 'clash' with their output). What about making it 2 characters, at least? U+0001 START OF HEADER U+xxxx whatever you like to indicate 'reversing' (i.e. an 'R', or just a 0-byte as this is the first purpose allocated, or whatever) This adds 2 bytes to each term, not 1, but terms generally don't take up that much room in the scale of a whole index and I think it's worth the flexibility. Hell, if you're willing to use 3 (that IS starting to seem wasteful, I admit) then maybe U+0001 START OF HEADER U+xxxx whatever U+0002 START OF TEXT That's at least semantically meaningful. Other ideas, just looking at the ASCII control characters: U+xxxx whatever U+001F UNIT SEPARATOR or U+000E SHIFT OUT U+xxxx whatever U+000F SHIFT IN I don't really mind, but it's always nice to plan ahead.
        Hide
        Robert Muir added a comment -

        what if we simply make it so there is no boolean option for a marker character, instead it is ReverseFilter() and ReverseFilter(char marker)
        This way, lucene does not define the character used for this operation, and someone can feel free to select whichever they want (such as U+0001)

        When we are on java 5 and can support supp. characters properly (reversing/wildcards,etc), then we can change this to ReverseFilter(int marker) and someone can use anything they want, including outside of the BMP?

        If this is ok, I will supply a patch.

        Show
        Robert Muir added a comment - what if we simply make it so there is no boolean option for a marker character, instead it is ReverseFilter() and ReverseFilter(char marker) This way, lucene does not define the character used for this operation, and someone can feel free to select whichever they want (such as U+0001) When we are on java 5 and can support supp. characters properly (reversing/wildcards,etc), then we can change this to ReverseFilter(int marker) and someone can use anything they want, including outside of the BMP? If this is ok, I will supply a patch.
        Hide
        Robert Muir added a comment -

        I looked into this and I think using the private use area (U+E000 to U+F8FF) would also not be the best.
        I do not think Lucene should use PUA characters system-internally, besides I have at least a few docs with PUA characters, and I think others will as well.
        We should leave PUA characters available to the end user.

        So personally I have nothing against this U+0001, but I'll take any recommendations...

        Show
        Robert Muir added a comment - I looked into this and I think using the private use area (U+E000 to U+F8FF) would also not be the best. I do not think Lucene should use PUA characters system-internally, besides I have at least a few docs with PUA characters, and I think others will as well. We should leave PUA characters available to the end user. So personally I have nothing against this U+0001, but I'll take any recommendations...
        Hide
        Paul Cowan added a comment - - edited

        Yeah, ok, makes sense.

        I'd suggest choosing a range of Private Use characters from the BMP block then, that's what they're for. Doesn't really matter which... we can pick a block of (say) 256 and use the first one for this, then others can be used for other purposes later if required. U+ECxx, maybe, because that's got 3 letters out of 'lucene' in it . So EC00 means 'reversed', and then people who need other similar filters can organise amongst themselves.

        Show
        Paul Cowan added a comment - - edited Yeah, ok, makes sense. I'd suggest choosing a range of Private Use characters from the BMP block then, that's what they're for. Doesn't really matter which... we can pick a block of (say) 256 and use the first one for this, then others can be used for other purposes later if required. U+ECxx, maybe, because that's got 3 letters out of 'lucene' in it . So EC00 means 'reversed', and then people who need other similar filters can organise amongst themselves.
        Hide
        Robert Muir added a comment -

        another issue, besides the fact they are deprecated, is that tag characters are outside of the BMP.

        Currently, the reverse filter does not properly reverse characters outside of the BMP [it does not recognize them as one character],
        This means characters such as tag characters will be 'reversed' into trail surrogate followed by lead surrogate (two unpaired surrogates).
        But we cannot fix the above, as lucene wildcard support does not recognize codepoints > FFFF as one 'character' either.

        If we are gonna pick a character other than U+0001, it needs to be inside the BMP.

        Show
        Robert Muir added a comment - another issue, besides the fact they are deprecated, is that tag characters are outside of the BMP. Currently, the reverse filter does not properly reverse characters outside of the BMP [it does not recognize them as one character] , This means characters such as tag characters will be 'reversed' into trail surrogate followed by lead surrogate (two unpaired surrogates). But we cannot fix the above, as lucene wildcard support does not recognize codepoints > FFFF as one 'character' either. If we are gonna pick a character other than U+0001, it needs to be inside the BMP.
        Hide
        Robert Muir added a comment -

        Sorry, I'm really anal about Unicode. Can't help it.

        Me too My problem with tag characters is that they are deprecated.

        I will take your advice and look and see if there is something more suitable.

        Show
        Robert Muir added a comment - Sorry, I'm really anal about Unicode. Can't help it. Me too My problem with tag characters is that they are deprecated. I will take your advice and look and see if there is something more suitable.
        Hide
        Mark Miller added a comment -

        Sorry, I'm really anal about Unicode. Can't help it.

        This is a full text search engine - we love it!

        Show
        Mark Miller added a comment - Sorry, I'm really anal about Unicode. Can't help it. This is a full text search engine - we love it!
        Hide
        Paul Cowan added a comment -

        Very very minor thing, but does it make more sense to choose a more suitable character? U+0001 is an assigned character, with some semantic meaning ("Start of Heading", same as ASCII character 0x01) which isn't really relevant to this use. It mightn't be a bad idea to (a) choose a control character which makes sense in context, if there is one (I can't see one, myself), (b) using a character from the private-use area (U+E000 to U+F8FF) or (c) my preferred option, using the Unicode tag characters. The tag characters are designed for just such a purpose.. embedding contextual metadata in text fields. The general syntax for a tag is <TAG TYPE> followed by one or more <TAG CHARACTER>s. Unfortunately, only one tag type is defined in unicode at present (language tag), which isn't suitable.

        That said, I think it makes sense (and is probably 'nicer') to pick one of the Unicode tag characters – say, U+E0052 TAG LATIN CAPITAL LETTER R (for 'reverse') and use that. This could lead to a de facto standard for Lucene fields, where different variations of the same token could use different leading tag characters. Rather than just everyone picking a character at random, this could lead to some sort of structure around similar situations (i.e. I could envisage a filter which uses U+E004E TAG LATIN CAPITAL LETTER N for a normalised version of the token, etc).

        Sorry, I'm really anal about Unicode. Can't help it.

        Show
        Paul Cowan added a comment - Very very minor thing, but does it make more sense to choose a more suitable character? U+0001 is an assigned character, with some semantic meaning ("Start of Heading", same as ASCII character 0x01) which isn't really relevant to this use. It mightn't be a bad idea to (a) choose a control character which makes sense in context, if there is one (I can't see one, myself), (b) using a character from the private-use area (U+E000 to U+F8FF) or (c) my preferred option, using the Unicode tag characters. The tag characters are designed for just such a purpose.. embedding contextual metadata in text fields. The general syntax for a tag is <TAG TYPE> followed by one or more <TAG CHARACTER>s. Unfortunately, only one tag type is defined in unicode at present (language tag), which isn't suitable. That said, I think it makes sense (and is probably 'nicer') to pick one of the Unicode tag characters – say, U+E0052 TAG LATIN CAPITAL LETTER R (for 'reverse') and use that. This could lead to a de facto standard for Lucene fields, where different variations of the same token could use different leading tag characters. Rather than just everyone picking a character at random, this could lead to some sort of structure around similar situations (i.e. I could envisage a filter which uses U+E004E TAG LATIN CAPITAL LETTER N for a normalised version of the token, etc). Sorry, I'm really anal about Unicode. Can't help it.
        Hide
        Mark Miller added a comment -

        +1

        Show
        Mark Miller added a comment - +1
        Hide
        Robert Muir added a comment -

        if no one objects to this one I will commit it tomorrow.

        Show
        Robert Muir added a comment - if no one objects to this one I will commit it tomorrow.
        Robert Muir made changes -
        Fix Version/s 2.9 [ 12312682 ]
        Lucene Fields [New] [New, Patch Available]
        Hide
        Robert Muir added a comment -

        thanks Andrzej, i think this patch is ready.

        Show
        Robert Muir added a comment - thanks Andrzej, i think this patch is ready.
        Andrzej Bialecki made changes -
        Attachment reverseMark-2.patch [ 12416671 ]
        Hide
        Andrzej Bialecki added a comment -

        Updated patch that moves the marking logic to incrementToken().

        Show
        Andrzej Bialecki added a comment - Updated patch that moves the marking logic to incrementToken().
        Hide
        Andrzej Bialecki added a comment -

        Either way is fine with me. To preserve the public API I think it's better to move this marking logic to incrementToken(). I'll prepare an updated patch.

        Show
        Andrzej Bialecki added a comment - Either way is fine with me. To preserve the public API I think it's better to move this marking logic to incrementToken(). I'll prepare an updated patch.
        Hide
        Robert Muir added a comment -

        andrzej, the reverse() methods are public, can you supply default impls (withMark=false) just in the case that someone is using them?

        alternatively, maybe the reverse() methods could stay the same, and the marking could happen in incrementToken() ?

        Show
        Robert Muir added a comment - andrzej, the reverse() methods are public, can you supply default impls (withMark=false) just in the case that someone is using them? alternatively, maybe the reverse() methods could stay the same, and the marking could happen in incrementToken() ?
        Hide
        Robert Muir added a comment -

        the corresponding solr task (SOLR-1321) is marked as version 1.4

        does anyone oppose putting this one in 2.9?

        Show
        Robert Muir added a comment - the corresponding solr task ( SOLR-1321 ) is marked as version 1.4 does anyone oppose putting this one in 2.9?
        Robert Muir made changes -
        Assignee Robert Muir [ rcmuir ]
        Andrzej Bialecki made changes -
        Field Original Value New Value
        Attachment reverseMark.patch [ 12416664 ]
        Hide
        Andrzej Bialecki added a comment -

        Patch and unit tests.

        Show
        Andrzej Bialecki added a comment - Patch and unit tests.
        Andrzej Bialecki created issue -

          People

          • Assignee:
            Robert Muir
            Reporter:
            Andrzej Bialecki
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development