Solr
  1. Solr
  2. SOLR-1061

Improve RegexTransformer to create multiple columns from regex groups

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Labels:
      None

      Description

      example

      <field column="full_name" regex="Mr(\w*)\b(\w*)" groupNames="firstName,secondName"/>
      

      This is more efficient in extracting multiple values from a single String. if some groups need to be omitted just leave it empty

      1. SOLR-1061.patch
        8 kB
        Noble Paul
      2. SOLR-1061.patch
        5 kB
        Noble Paul
      3. SOLR-1061.patch
        4 kB
        Noble Paul

        Activity

        Hide
        Fergus McMenemie added a comment -

        Yes, yes. Another usecase I ran into a lot was having lat/long within the same XML field, this would have been really useful. I guess if the matcher fails the fields/colums firstName and secondName are undefined? However although the above is neat and clean it can of course now be done as follows:-

           <field column="firstName"       regex="Mr(\w*)\b\w*" replaceWith="$1"  sourceColName="full_name"/>
           <field column="secondName" regex="Mr\w*\b(\w*)" replaceWith="$1"  sourceColName="full_name"/>
        

        Also I would think the following will be a related common usecase; imagine a field which listed an indeterminate number of aliases or alternate names for a person. This is bad data design but it happens. We need to expose regex's global feature

        <firstName>josephine</firstname>
        <aliases>jo,joe,jos<aliases>
        
           <field column="alias" regex="([^,]+)"  regex_options="global" sourceColName="aliases"/>
        

        which would populate the column alias with multiple values. The attribute regex_options allows other regex options such as case insensitivity to be added as well.

        Show
        Fergus McMenemie added a comment - Yes, yes. Another usecase I ran into a lot was having lat/long within the same XML field, this would have been really useful. I guess if the matcher fails the fields/colums firstName and secondName are undefined? However although the above is neat and clean it can of course now be done as follows:- <field column= "firstName" regex= "Mr(\w*)\b\w*" replaceWith= "$1" sourceColName= "full_name" /> <field column= "secondName" regex= "Mr\w*\b(\w*)" replaceWith= "$1" sourceColName= "full_name" /> Also I would think the following will be a related common usecase; imagine a field which listed an indeterminate number of aliases or alternate names for a person. This is bad data design but it happens. We need to expose regex's global feature <firstName>josephine</firstname> <aliases>jo,joe,jos<aliases> <field column= "alias" regex= "([^,]+)" regex_options= "global" sourceColName= "aliases" /> which would populate the column alias with multiple values. The attribute regex_options allows other regex options such as case insensitivity to be added as well.
        Hide
        Noble Paul added a comment -

        fix . w/ no testcase

        Show
        Noble Paul added a comment - fix . w/ no testcase
        Hide
        Noble Paul added a comment -

        fix w/o testcase

        Show
        Noble Paul added a comment - fix w/o testcase
        Hide
        Noble Paul added a comment -

        there was a bug in the last patch

        Show
        Noble Paul added a comment - there was a bug in the last patch
        Hide
        Noble Paul added a comment -

        with testcase

        Show
        Noble Paul added a comment - with testcase
        Hide
        Shalin Shekhar Mangar added a comment -

        Committed revision 755143.

        Thanks Noble!

        Fergus, can you open another issue for the other enhancements you mentioned? Also, I see only these flags in java.util.Pattern

        1. CASE_INSENSITIVE
        2. MULTILINE
        3. DOTALL
        4. UNICODE_CASE
        5. CANON_EQ

        What does global do exactly?

        Show
        Shalin Shekhar Mangar added a comment - Committed revision 755143. Thanks Noble! Fergus, can you open another issue for the other enhancements you mentioned? Also, I see only these flags in java.util.Pattern CASE_INSENSITIVE MULTILINE DOTALL UNICODE_CASE CANON_EQ What does global do exactly?
        Hide
        Grant Ingersoll added a comment -

        Bulk close Solr 1.4 issues

        Show
        Grant Ingersoll added a comment - Bulk close Solr 1.4 issues

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Noble Paul
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development