Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      This is a Tokenizer that splits the input string into a series of paths. For example:

      /aaa/bbb/ccc

      becomes:

      /aaa/
      /aaa/bbb/
      /aaa/bbb/ccc

      1. SOLR-1057.patch
        14 kB
        Koji Sekiguchi
      2. SOLR-1057.patch
        13 kB
        Koji Sekiguchi
      3. SOLR-1057.patch
        12 kB
        Koji Sekiguchi
      4. SOLR-1057-PathTokenizerFactory.patch
        6 kB
        Ryan McKinley
      5. SOLR-1057-PathTokenizerFactory.patch
        5 kB
        Ryan McKinley

        Issue Links

          Activity

          Hide
          Ryan McKinley added a comment -

          updated to use reusable token stuff

          Show
          Ryan McKinley added a comment - updated to use reusable token stuff
          Hide
          Lance Norskog added a comment -

          It is also useful to generate the reverse:
          /
          /ccc/bbb/aaa

          When /aaa/bbb/ccc is a tree- (or directed graph-) structured taxonomy, it is useful in a UI to start and the bottom and find all of the paths it is part of. For example, orange is both a fruit and a color:

          /color/warm/orange
          /food/fruit/orange

          I would want to say /orange* and get both of the above paths. I may or may not want to generate these:
          /ccc
          /ccc/bbb/

          So that should probably be another option.

          Show
          Lance Norskog added a comment - It is also useful to generate the reverse: / /ccc/bbb/aaa When /aaa/bbb/ccc is a tree- (or directed graph-) structured taxonomy, it is useful in a UI to start and the bottom and find all of the paths it is part of. For example, orange is both a fruit and a color: /color/warm/orange /food/fruit/orange I would want to say /orange* and get both of the above paths. I may or may not want to generate these: /ccc /ccc/bbb/ So that should probably be another option.
          Hide
          Lance Norskog added a comment -

          Also, the field must be multi-valued. This should be added to the Factory.

          Show
          Lance Norskog added a comment - Also, the field must be multi-valued. This should be added to the Factory.
          Hide
          Dave Craft added a comment -

          Hey.. any update on this one? I'm looking for a way for Solr to store a tree structure category. E.g. a taxonomy. Perhaps there is a way to do this already? If someone could point me in the right direction that would be great.

          Thanks

          Dave

          Show
          Dave Craft added a comment - Hey.. any update on this one? I'm looking for a way for Solr to store a tree structure category. E.g. a taxonomy. Perhaps there is a way to do this already? If someone could point me in the right direction that would be great. Thanks Dave
          Hide
          Koji Sekiguchi added a comment - - edited

          I think this can be used for SOLR-64. I'll take it.

          TODO:

          • move PathTokenizer to modules/analysis/common/src/java/org/apache/lucene/analysis/path/ (4.0) or lucene/src/java/org/apache/lucene/analysis/ (3.1)
          • make test cases
          • respect the original path delimiter (seems current patch outputs backslash even if the input uses slash)
          • accept an arbitrary delimiter and replacement
          • add offset correction
          Show
          Koji Sekiguchi added a comment - - edited I think this can be used for SOLR-64 . I'll take it. TODO: move PathTokenizer to modules/analysis/common/src/java/org/apache/lucene/analysis/path/ (4.0) or lucene/src/java/org/apache/lucene/analysis/ (3.1) make test cases respect the original path delimiter (seems current patch outputs backslash even if the input uses slash) accept an arbitrary delimiter and replacement add offset correction
          Hide
          Koji Sekiguchi added a comment -

          A new patch added.

          Show
          Koji Sekiguchi added a comment - A new patch added.
          Hide
          Ryan McKinley added a comment -

          looks good! I like the tests and configurable delimiter, but maybe we should allow multiple values?

          In my app, I need to apply this filter to windows paths (C:\path) urls, and unix paths...

          Maybe this could take a string as an argument with max length 2? then keep delimiter1 and delimiter2, and use:

          if( c == delimiter1 || c == delimiter2 ) {
          

          also the javadoc should replace 'somethine' with 'something'

          Show
          Ryan McKinley added a comment - looks good! I like the tests and configurable delimiter, but maybe we should allow multiple values? In my app, I need to apply this filter to windows paths (C:\path) urls, and unix paths... Maybe this could take a string as an argument with max length 2? then keep delimiter1 and delimiter2, and use: if ( c == delimiter1 || c == delimiter2 ) { also the javadoc should replace 'somethine' with 'something'
          Hide
          Koji Sekiguchi added a comment -

          Can you use MappingCharFilter to normalize backslash to slash?

          Show
          Koji Sekiguchi added a comment - Can you use MappingCharFilter to normalize backslash to slash?
          Hide
          Ryan McKinley added a comment -

          that would work if this were a filter... but I would need to run the MappingCharFilter before the path tokenizer.

          Perhaps we should change this to a Filter, and use the KeywordTokenizer to start?

          Show
          Ryan McKinley added a comment - that would work if this were a filter... but I would need to run the MappingCharFilter before the path tokenizer. Perhaps we should change this to a Filter, and use the KeywordTokenizer to start?
          Hide
          Robert Muir added a comment -

          I'm a little confused about the use of the tokenizer (i have no problems technically, its maybe a naming issue?)

          Is this intended for tokenizing file pathnames as its name would suggest? In this case I think the path should have positions, e.g. /foo/bar/whatever.txt is foo(1), bar(1), whatever.txt(1)?

          It seems instead, this one is intended for representing hierarchies, as it creates synonyms of /foo, /foo/bar, /foo/bar/whatever.txt... with position increments of zero.

          I guess I'm just being picky about naming, but i think this hierarchical case is more specific than 'tokenizing file pathnames' and maybe a name like HierarchyTokenizer (this one too probably isn't the best!) would better represent what it does?

          Show
          Robert Muir added a comment - I'm a little confused about the use of the tokenizer (i have no problems technically, its maybe a naming issue?) Is this intended for tokenizing file pathnames as its name would suggest? In this case I think the path should have positions, e.g. /foo/bar/whatever.txt is foo(1), bar(1), whatever.txt(1)? It seems instead, this one is intended for representing hierarchies, as it creates synonyms of /foo, /foo/bar, /foo/bar/whatever.txt... with position increments of zero. I guess I'm just being picky about naming, but i think this hierarchical case is more specific than 'tokenizing file pathnames' and maybe a name like HierarchyTokenizer (this one too probably isn't the best!) would better represent what it does?
          Hide
          Ryan McKinley added a comment -

          Maybe PathHierarchyTokenizer?

          Yes, the point is to preserve the folder/path structure.

          Show
          Ryan McKinley added a comment - Maybe PathHierarchyTokenizer? Yes, the point is to preserve the folder/path structure.
          Hide
          Koji Sekiguchi added a comment -

          that would work if this were a filter... but I would need to run the MappingCharFilter before the path tokenizer.

          CharFilters run before Tokenizer.

          Maybe PathHierarchyTokenizer?

          +1.

          Show
          Koji Sekiguchi added a comment - that would work if this were a filter... but I would need to run the MappingCharFilter before the path tokenizer. CharFilters run before Tokenizer. Maybe PathHierarchyTokenizer? +1.
          Hide
          Koji Sekiguchi added a comment -

          New patch. renamed it to PathHierarchyTokenizer.

          Show
          Koji Sekiguchi added a comment - New patch. renamed it to PathHierarchyTokenizer.
          Hide
          Koji Sekiguchi added a comment -

          A new patch. To respond Ryan's requirement, I added the following test:

          public void testNormalizeWinDelimToLinuxDelim() throws Exception {
            NormalizeCharMap normMap = new NormalizeCharMap();
            normMap.add("\\", "/");
            String path = "c:\\a\\b\\c";
            CharStream cs = new MappingCharFilter(normMap, new StringReader(path));
            PathHierarchyTokenizer t = new PathHierarchyTokenizer( cs );
            assertTokenStreamContents(t,
                new String[]{"c:", "c:/a", "c:/a/b", "c:/a/b/c"},
                new int[]{0, 0, 0, 0},
                new int[]{2, 4, 6, 8},
                new int[]{1, 0, 0, 0},
                path.length());
          }
          
          Show
          Koji Sekiguchi added a comment - A new patch. To respond Ryan's requirement, I added the following test: public void testNormalizeWinDelimToLinuxDelim() throws Exception { NormalizeCharMap normMap = new NormalizeCharMap(); normMap.add( "\\" , "/" ); String path = "c:\\a\\b\\c" ; CharStream cs = new MappingCharFilter(normMap, new StringReader(path)); PathHierarchyTokenizer t = new PathHierarchyTokenizer( cs ); assertTokenStreamContents(t, new String []{ "c:" , "c:/a" , "c:/a/b" , "c:/a/b/c" }, new int []{0, 0, 0, 0}, new int []{2, 4, 6, 8}, new int []{1, 0, 0, 0}, path.length()); }
          Hide
          Ryan McKinley added a comment -

          I was totally unaware of CharFilters and how they are called before the Tokenizer.

          thanks Koji!

          Show
          Ryan McKinley added a comment - I was totally unaware of CharFilters and how they are called before the Tokenizer. thanks Koji!
          Hide
          Koji Sekiguchi added a comment -

          Committed revision 1067131 (trunk). I'll back port to 3x tomorrow because I have to move now.

          Show
          Koji Sekiguchi added a comment - Committed revision 1067131 (trunk). I'll back port to 3x tomorrow because I have to move now.
          Hide
          Yonik Seeley added a comment -

          Is this generally applicable enough we should add an entry in
          http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
          or is there a better place?

          Show
          Yonik Seeley added a comment - Is this generally applicable enough we should add an entry in http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters or is there a better place?
          Hide
          Ryan McKinley added a comment -

          I think it will warrant its own page... in combination with SOLR-64

          but should probably also be linked on the main analyzer page.

          Show
          Ryan McKinley added a comment - I think it will warrant its own page... in combination with SOLR-64 but should probably also be linked on the main analyzer page.
          Hide
          Koji Sekiguchi added a comment -

          Committed revision 1067352 (3x).

          Show
          Koji Sekiguchi added a comment - Committed revision 1067352 (3x).
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release

            People

            • Assignee:
              Koji Sekiguchi
              Reporter:
              Ryan McKinley
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development