Lucene - Core
  1. Lucene - Core
  2. LUCENE-3071

PathHierarchyTokenizer adaptation for urls: splits reversed

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      PathHierarchyTokenizer should be usable to split urls the a "reversed" way (useful for faceted search against urls):
      www.site.com -> www.site.com, site.com, com

      Moreover, it should be able to skip a given number of first (or last, if reversed) tokens:
      /usr/share/doc/somesoftware/INTERESTING/PART
      Should give with 4 tokens skipped:
      INTERESTING
      INTERESTING/PART

      1. ant.log.tar.bz2
        15 kB
        Olivier Favre
      2. LUCENE-3071.patch
        22 kB
        Ryan McKinley
      3. LUCENE-3071.patch
        19 kB
        Ryan McKinley
      4. LUCENE-3071.patch
        18 kB
        Olivier Favre
      5. LUCENE-3071.patch
        18 kB
        Olivier Favre

        Activity

        Hide
        Olivier Favre added a comment -

        Proposed patch attached.

        Working against Lucene 3.1 (remove the path.length() last parameter to assert call).
        But I am having difficulties making the tests work against trunk (ant and ant test fail, at global scope).

        Show
        Olivier Favre added a comment - Proposed patch attached. Working against Lucene 3.1 (remove the path.length() last parameter to assert call). But I am having difficulties making the tests work against trunk ( ant and ant test fail, at global scope).
        Hide
        Robert Muir added a comment -

        But I am having difficulties making the tests work against trunk (ant and ant test fail, at global scope).

        Can you provide more details about this?

        If possible stuff like ant version, whether you are using an svn checkout (and what the full path is), logs of what error messages, etc would be great.

        Feel free to open a new jira issue for these problems!

        Show
        Robert Muir added a comment - But I am having difficulties making the tests work against trunk (ant and ant test fail, at global scope). Can you provide more details about this? If possible stuff like ant version, whether you are using an svn checkout (and what the full path is), logs of what error messages, etc would be great. Feel free to open a new jira issue for these problems!
        Hide
        Olivier Favre added a comment -

        I'm using Ubuntu 10.04.2 LTS.
        ant -version
        Apache Ant version 1.7.1 compiled on September 8 2010
        I followed the wiki: http://wiki.apache.org/lucene-java/HowToContribute
        I used svn checkout http://svn.eu.apache.org/repos/asf/lucene/dev/trunk/ lucene-trunk.
        I'm working under revision 1099843 (yours).
        See ant log attached.

        Show
        Olivier Favre added a comment - I'm using Ubuntu 10.04.2 LTS. ant -version Apache Ant version 1.7.1 compiled on September 8 2010 I followed the wiki: http://wiki.apache.org/lucene-java/HowToContribute I used svn checkout http://svn.eu.apache.org/repos/asf/lucene/dev/trunk/ lucene-trunk. I'm working under revision 1099843 (yours). See ant log attached.
        Hide
        Robert Muir added a comment -

        Hi Olivier, thanks for uploading the log.

        This test fails for me sometimes too, somehow we should get to the bottom of it.
        I opened an issue: SOLR-2500

        As a workaround, perhaps using 'ant clean test' will help... I fought with this
        test a little bit the other day and somehow 'clean' seemed to temporarily get the
        test passing...

        Show
        Robert Muir added a comment - Hi Olivier, thanks for uploading the log. This test fails for me sometimes too, somehow we should get to the bottom of it. I opened an issue: SOLR-2500 As a workaround, perhaps using 'ant clean test' will help... I fought with this test a little bit the other day and somehow 'clean' seemed to temporarily get the test passing...
        Hide
        Olivier Favre added a comment -

        ant clean test did it for me, thanks!

        As for the failing tests, it is because of the finalOffset that I set to path.length().
        I'm not sure whether I should use path.length(), as my tokens don't go up to there when using the reverse mode.
        When I take a look at the the end() function, I think that I "should" set it to the end of the string. But I can't see it on the javadoc.
        If the purpose of the finalOffset parameter in assertTokenStreamContents() it to make sure of the endOffset of the last term, then I should not use path.length() blindly when using reverse and skip.

        Can you help me with the purpose of finalOffset? Or can I simply skip it in my tests (they are working if I skip it)?

        Thanks

        Show
        Olivier Favre added a comment - ant clean test did it for me, thanks! As for the failing tests, it is because of the finalOffset that I set to path.length() . I'm not sure whether I should use path.length() , as my tokens don't go up to there when using the reverse mode. When I take a look at the the end() function, I think that I "should" set it to the end of the string. But I can't see it on the javadoc. If the purpose of the finalOffset parameter in assertTokenStreamContents() it to make sure of the endOffset of the last term, then I should not use path.length() blindly when using reverse and skip. Can you help me with the purpose of finalOffset ? Or can I simply skip it in my tests (they are working if I skip it)? Thanks
        Hide
        Robert Muir added a comment -

        Can you help me with the purpose of finalOffset? Or can I simply skip it in my tests (they are working if I skip it)?

        The finalOffset is supposed to be the offset of the entire document, this is useful so that offsets are correct on multivalued fields.

        Example multivalued field "foo" with two values:
        "bar " <-- this one ends with a space
        "baz"

        With a whitespace tokenizer, value 1 will have a single token "bar" with startOffset=0, endOffset=3. But, finalOffset needs to be 4 (essentially however many chars you read in from the Reader)

        This way, the offsets will then accumulate correctly for "baz".

        Show
        Robert Muir added a comment - Can you help me with the purpose of finalOffset? Or can I simply skip it in my tests (they are working if I skip it)? The finalOffset is supposed to be the offset of the entire document, this is useful so that offsets are correct on multivalued fields. Example multivalued field "foo" with two values: "bar " <-- this one ends with a space "baz" With a whitespace tokenizer, value 1 will have a single token "bar" with startOffset=0, endOffset=3. But, finalOffset needs to be 4 (essentially however many chars you read in from the Reader) This way, the offsets will then accumulate correctly for "baz".
        Hide
        Olivier Favre added a comment - - edited

        Fixed patch attached.

        Tests run fine now.

        Ready to ship?

        Show
        Olivier Favre added a comment - - edited Fixed patch attached. Tests run fine now. Ready to ship?
        Hide
        Robert Muir added a comment -

        Hi Olivier, at a glance the patch looks really great to me, thanks!

        I took a quick look (not in enough detail) but had these thoughts, neither of which I think are really mandatory for this feature, just ideas:

        • do you think it would be cleaner if we made it a separate tokenizer? (e.g. ReversePath....). The main logic of the tokenizer seems to be completely split, depending on whether you are reversing or not.
        • i think its possible in the future we could simplify the way finalOffset is being tracked, such that we just accumulate it on every read(), and then correctOffset a single time in end(). (i don't think this has much to do with your patch, just looking at the code in general).
        Show
        Robert Muir added a comment - Hi Olivier, at a glance the patch looks really great to me, thanks! I took a quick look (not in enough detail) but had these thoughts, neither of which I think are really mandatory for this feature, just ideas: do you think it would be cleaner if we made it a separate tokenizer? (e.g. ReversePath....). The main logic of the tokenizer seems to be completely split, depending on whether you are reversing or not. i think its possible in the future we could simplify the way finalOffset is being tracked, such that we just accumulate it on every read(), and then correctOffset a single time in end(). (i don't think this has much to do with your patch, just looking at the code in general).
        Hide
        Ryan McKinley added a comment -

        i'll take a look

        Show
        Ryan McKinley added a comment - i'll take a look
        Hide
        Ryan McKinley added a comment -

        do you think it would be cleaner if we made it a separate tokenizer?

        I think its a tossup – having keeping them together makes one less factory in solr (not much of an argument) and the other three parameters (delimiter,replacement,skip) are nice to keep consistent.

        Show
        Ryan McKinley added a comment - do you think it would be cleaner if we made it a separate tokenizer? I think its a tossup – having keeping them together makes one less factory in solr (not much of an argument) and the other three parameters (delimiter,replacement,skip) are nice to keep consistent.
        Hide
        Ryan McKinley added a comment -

        updated patch that includes solr factory.

        Robert if this looks ok to you, i will go ahead and commit

        Show
        Ryan McKinley added a comment - updated patch that includes solr factory. Robert if this looks ok to you, i will go ahead and commit
        Hide
        Robert Muir added a comment -

        having keeping them together makes one less factory in solr (not much of an argument)

        I don't understand this?

        You can still have one solr factory, if reverse=true it creates ReverseXXX...

        Show
        Robert Muir added a comment - having keeping them together makes one less factory in solr (not much of an argument) I don't understand this? You can still have one solr factory, if reverse=true it creates ReverseXXX...
        Hide
        Hoss Man added a comment -

        You can still have one solr factory, if reverse=true it creates ReverseXXX...

        right ... if it makes the code cleaner to have two distinct Tokenizer impls, they can still share one factory.

        Show
        Hoss Man added a comment - You can still have one solr factory, if reverse=true it creates ReverseXXX... right ... if it makes the code cleaner to have two distinct Tokenizer impls, they can still share one factory.
        Hide
        Ryan McKinley added a comment -

        split into two classes... and two test classes

        Show
        Ryan McKinley added a comment - split into two classes... and two test classes
        Hide
        Robert Muir added a comment -

        this looks great!

        Show
        Robert Muir added a comment - this looks great!
        Hide
        Ryan McKinley added a comment -

        I had some trouble merging from trunk to 3.x, and needed to just copy a few files – i hope that does not muck stuff up.

        Thanks Olivier!

        Show
        Ryan McKinley added a comment - I had some trouble merging from trunk to 3.x, and needed to just copy a few files – i hope that does not muck stuff up. Thanks Olivier!
        Hide
        Simon Willnauer added a comment -

        we got failures due to a missing file on 3.x - reopen

        Show
        Simon Willnauer added a comment - we got failures due to a missing file on 3.x - reopen
        Hide
        Simon Willnauer added a comment -

        I had some trouble merging from trunk to 3.x, and needed to just copy a few files – i hope that does not muck stuff up.

        that change didn't seem to be hard to merge what did you do instead or merging? Maybe next time you should ask how to do it before copying stuff from trunk to branch?

        we also have a "how to merge" wiki here http://wiki.apache.org/lucene-java/SvnMerge

        I merged the missing file in and committed I will close once hudson is building correctly.

        Show
        Simon Willnauer added a comment - I had some trouble merging from trunk to 3.x, and needed to just copy a few files – i hope that does not muck stuff up. that change didn't seem to be hard to merge what did you do instead or merging? Maybe next time you should ask how to do it before copying stuff from trunk to branch? we also have a "how to merge" wiki here http://wiki.apache.org/lucene-java/SvnMerge I merged the missing file in and committed I will close once hudson is building correctly.
        Hide
        Ryan McKinley added a comment -

        thanks simon – sorry about that, i will check for merge help next time.

        Show
        Ryan McKinley added a comment - thanks simon – sorry about that, i will check for merge help next time.
        Hide
        Olivier Favre added a comment -

        It seems you forgot to commit lucene/contrib/CHANGES.txt to describe the new feature.

        Regards

        Show
        Olivier Favre added a comment - It seems you forgot to commit lucene/contrib/CHANGES.txt to describe the new feature. Regards
        Hide
        Ryan McKinley added a comment -

        check:
        http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/CHANGES.txt?p2=%2Flucene%2Fdev%2Fbranches%2Fbranch_3x%2Flucene%2FCHANGES.txt&p1=%2Flucene%2Fdev%2Fbranches%2Fbranch_3x%2Flucene%2FCHANGES.txt&r1=1100033&r2=1100032&view=diff&pathrev=1100033

        It is on the 3.x branch – my understanding is that will get merged with the trunk changes when 4.x is valid.

        I think this was the resolution of how we will handle changes, but now that you ask, i am not sure... anyone know what the current CHANGES policy is?

        Show
        Ryan McKinley added a comment - check: http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/CHANGES.txt?p2=%2Flucene%2Fdev%2Fbranches%2Fbranch_3x%2Flucene%2FCHANGES.txt&p1=%2Flucene%2Fdev%2Fbranches%2Fbranch_3x%2Flucene%2FCHANGES.txt&r1=1100033&r2=1100032&view=diff&pathrev=1100033 It is on the 3.x branch – my understanding is that will get merged with the trunk changes when 4.x is valid. I think this was the resolution of how we will handle changes, but now that you ask, i am not sure... anyone know what the current CHANGES policy is?
        Hide
        Robert Muir added a comment -

        bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - bulk move 3.2 -> 3.3
        Hide
        Simon Willnauer added a comment -

        Ryan, are you working on this? This seems to have a working patch (at the time being) any reason why this didn't make it? if you are not working on this can you move it to unassigned?

        Show
        Simon Willnauer added a comment - Ryan, are you working on this? This seems to have a working patch (at the time being) any reason why this didn't make it? if you are not working on this can you move it to unassigned?
        Hide
        Ryan McKinley added a comment -

        Looks like this was actually resolved long ago... the only thing pending was that I did not understand how to manage the CHANGES.txt from 3.x and /trunk

        Show
        Ryan McKinley added a comment - Looks like this was actually resolved long ago... the only thing pending was that I did not understand how to manage the CHANGES.txt from 3.x and /trunk
        Hide
        Uwe Schindler added a comment -

        Bulk close after release of 3.5

        Show
        Uwe Schindler added a comment - Bulk close after release of 3.5

          People

          • Assignee:
            Ryan McKinley
            Reporter:
            Olivier Favre
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 2h
              2h
              Remaining:
              Remaining Estimate - 2h
              2h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development