Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: search
    • Labels:
      None

      Description

      A TokenizerFactory that makes tokens from:

      string.split( regex );

      1. SOLR-211-RegexSplitTokenizer.patch
        9 kB
        Ryan McKinley
      2. SOLR-211-RegexSplitTokenizer.patch
        7 kB
        Ryan McKinley
      3. SOLR-211-RegexSplitTokenizer.patch
        5 kB
        Ryan McKinley

        Activity

        Hide
        Ryan McKinley added a comment -

        simple regex tokenizer and a test.

        <fieldType name="splitText" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
        <tokenizer class="solr.RegexSplitTokenizerFactory" regex="--"/>
        <filter class="solr.TrimFilterFactory" />
        </analyzer>
        </fieldType>

        Given a field:
        "Architecture-United States-19th century"

        will create tokens for:
        "Architecture"
        "United States"
        "19th century"

        Show
        Ryan McKinley added a comment - simple regex tokenizer and a test. <fieldType name="splitText" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.RegexSplitTokenizerFactory" regex="--"/> <filter class="solr.TrimFilterFactory" /> </analyzer> </fieldType> Given a field: "Architecture- United States -19th century" will create tokens for: "Architecture" "United States" "19th century"
        Hide
        Hoss Man added a comment -

        some quick comments based on a cursory reading of the patch...

        1) RegexSplitTokenizerFactory.init should probably compile the regex into a pattern that can be reused more then once ... i think String.split calls recompile each time.
        2) i don't think the offset stuff will work properly ... the length of the regex string is not the same as the length of the string it matches on when splitting (ie: \p

        {javaWhitespace}

        ) ... we would probably need to use the Matcher API and iterate over the individual matches.
        3) in the vein of like things having like names, we may wan to call this the PatternSplitTokenizer and name it's init param "pattern" (to match PatternReplaceFilter)

        Show
        Hoss Man added a comment - some quick comments based on a cursory reading of the patch... 1) RegexSplitTokenizerFactory.init should probably compile the regex into a pattern that can be reused more then once ... i think String.split calls recompile each time. 2) i don't think the offset stuff will work properly ... the length of the regex string is not the same as the length of the string it matches on when splitting (ie: \p {javaWhitespace} ) ... we would probably need to use the Matcher API and iterate over the individual matches. 3) in the vein of like things having like names, we may wan to call this the PatternSplitTokenizer and name it's init param "pattern" (to match PatternReplaceFilter)
        Hide
        Yonik Seeley added a comment -

        > should probably compile the regex [...]

        Yep... beat me to it.
        I was off trying to look up if there was a way to avoid reading everything into a String too... but I don't see a way to use a regex directly on a Reader.

        Show
        Yonik Seeley added a comment - > should probably compile the regex [...] Yep... beat me to it. I was off trying to look up if there was a way to avoid reading everything into a String too... but I don't see a way to use a regex directly on a Reader.
        Hide
        Hoss Man added a comment -

        > but I don't see a way to use a regex directly on a Reader.

        ...I think it's pretty much impossible to have a robust regex system that can operate on character streams, regex engines need to be able to backup .... a lot.

        Show
        Hoss Man added a comment - > but I don't see a way to use a regex directly on a Reader. ...I think it's pretty much impossible to have a robust regex system that can operate on character streams, regex engines need to be able to backup .... a lot.
        Hide
        Ryan McKinley added a comment -

        Thanks for the quick feedback!

        Here is an updated version that

        1. uses a compiled Pattern
        2. uses matcher.find() to set proper start and offeset
        3. is called PatternSplitTokenizerFactory
        4. The tests make sure the output is the same as you would get with string.split( pattern )

        Show
        Ryan McKinley added a comment - Thanks for the quick feedback! Here is an updated version that 1. uses a compiled Pattern 2. uses matcher.find() to set proper start and offeset 3. is called PatternSplitTokenizerFactory 4. The tests make sure the output is the same as you would get with string.split( pattern )
        Hide
        Ryan McKinley added a comment -

        Using a Matcher to generate the tokens makes it easy enough to return the match as token – not just the split()

        • Updated to take a "group" argument - if the group is less then zero, it behaves as a split, otherwise it uses the matched group as the token.
        • Changed the name to PatternTokenizerFactory as it is more general then just split
        Show
        Ryan McKinley added a comment - Using a Matcher to generate the tokens makes it easy enough to return the match as token – not just the split() Updated to take a "group" argument - if the group is less then zero, it behaves as a split, otherwise it uses the matched group as the token. Changed the name to PatternTokenizerFactory as it is more general then just split
        Hide
        Ken Krugler added a comment -

        I think we must be working on similar types of projects

        I did something similar to the above, but in two different ways:

        1. I extended WhitespaceTokenizerFactory to take optional pattern & replacement parameters. If these exist, then I apply them before the tokenizer gets called. This lets me do something like strip out all XML fields other than the content of the one that I want to index from a bunch of XML going into a Solr field.
        2. I added a CSVTokenizerFactory, which takes an optional split character and an optional remapping file. This lets me get a field like "Java,Python,C#" and turn it into "java python csharp", which are the index tokens I need, while leaving the display text as-is.

        I don't know if your new PatternTokenizerFactory could replace either of these, though. For the first case, I still want the white space tokenization after I've stripped off all the junk I don't want. And for the second, I need to be able to do the remapping.

        Show
        Ken Krugler added a comment - I think we must be working on similar types of projects I did something similar to the above, but in two different ways: I extended WhitespaceTokenizerFactory to take optional pattern & replacement parameters. If these exist, then I apply them before the tokenizer gets called. This lets me do something like strip out all XML fields other than the content of the one that I want to index from a bunch of XML going into a Solr field. I added a CSVTokenizerFactory, which takes an optional split character and an optional remapping file. This lets me get a field like "Java,Python,C#" and turn it into "java python csharp", which are the index tokens I need, while leaving the display text as-is. I don't know if your new PatternTokenizerFactory could replace either of these, though. For the first case, I still want the white space tokenization after I've stripped off all the junk I don't want. And for the second, I need to be able to do the remapping.
        Hide
        Ryan McKinley added a comment -

        >
        > I don't know if your new PatternTokenizerFactory could replace either of these, though. For the first case, I still want the white space tokenization after I've stripped off all the junk I don't want. And for the second, I need to be able to do the remapping.
        >

        If your really good with regular expressions, perhaps it could all be combined... I'm not

        In my real use case, I use the general PatternTokenizerFactory to split the input into a bunch of tokens, then I have a custom (ugly!) TokenFilter transform the stream with other one-off transformations similar to what you describe.

        Show
        Ryan McKinley added a comment - > > I don't know if your new PatternTokenizerFactory could replace either of these, though. For the first case, I still want the white space tokenization after I've stripped off all the junk I don't want. And for the second, I need to be able to do the remapping. > If your really good with regular expressions, perhaps it could all be combined... I'm not In my real use case, I use the general PatternTokenizerFactory to split the input into a bunch of tokens, then I have a custom (ugly!) TokenFilter transform the stream with other one-off transformations similar to what you describe.
        Hide
        Ryan McKinley added a comment -

        added in rev:532508

        I'm not sure how to make the svn changelog show up in JIRA. It looks like issues may get automatically linked if you start the svn comment with SOLR-XXX. Is this true?

        https://issues.apache.org/jira/browse/SOLR-104?page=com.atlassian.jira.plugin.ext.subversion:subversion-commits-tabpanel

        Show
        Ryan McKinley added a comment - added in rev:532508 I'm not sure how to make the svn changelog show up in JIRA. It looks like issues may get automatically linked if you start the svn comment with SOLR-XXX. Is this true? https://issues.apache.org/jira/browse/SOLR-104?page=com.atlassian.jira.plugin.ext.subversion:subversion-commits-tabpanel
        Hide
        Hoss Man added a comment -

        This bug was modified as part of a bulk update using the criteria...

        • Marked ("Resolved" or "Closed") and "Fixed"
        • Had no "Fix Version" versions
        • Was listed in the CHANGES.txt for 1.2

        The Fix Version for all 39 issues found was set to 1.2, email notification
        was suppressed to prevent excessive email.

        For a list of all the issues modified, search jira comments for this
        (hopefully) unique string: 20080415hossman2

        Show
        Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.2 The Fix Version for all 39 issues found was set to 1.2, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman2

          People

          • Assignee:
            Ryan McKinley
            Reporter:
            Ryan McKinley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development