Solr
  1. Solr
  2. SOLR-6937

In schemaless mode ,replace spaces and special characters in field names with underscore

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 5.1, 6.0
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Assuming spaces in field names are still bad, we should automatically convert them to not have spaces. For instance, I indexed Citibike public data set which has:

      "tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"

      My vote would be to replace spaces w/ underscores.

      1. SOLR-6937.patch
        9 kB
        Noble Paul
      2. SOLR-6937.patch
        5 kB
        Noble Paul

        Activity

        Hide
        Hoss Man added a comment -

        My vote would be to replace spaces w/ underscores.

        Could probably be solved with a ~6 line subclass of FieldMutatingUpdateProcessor

        Show
        Hoss Man added a comment - My vote would be to replace spaces w/ underscores. Could probably be solved with a ~6 line subclass of FieldMutatingUpdateProcessor
        Hide
        Erik Hatcher added a comment -

        Chris Hostetter (Unused), I tried this:

        public class NormalizeFieldNameUpdateProcessorFactory extends FieldMutatingUpdateProcessorFactory {
          @Override
          public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) {
            return new FieldMutatingUpdateProcessor(getSelector(), next) {
              @Override
              protected SolrInputField mutate(SolrInputField src) {
                src.setName(src.getName().replace(' ', '_'));
                return src;
              }
           };
          }
        }
        

        And got this error:

        <lst name="error"><str name="msg">mutate returned field with different name: field with spaces =&gt; field_with_spaces</str><str name="trace">org.apache.solr.common.SolrException: mutate returned field with different name: field with spaces =&gt; field_with_spaces...
        

        Are there problems that would result when changing the name of a field in FieldMutatingUpdateProcessor?

        Show
        Erik Hatcher added a comment - Chris Hostetter (Unused) , I tried this: public class NormalizeFieldNameUpdateProcessorFactory extends FieldMutatingUpdateProcessorFactory { @Override public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { return new FieldMutatingUpdateProcessor(getSelector(), next) { @Override protected SolrInputField mutate(SolrInputField src) { src.setName(src.getName().replace(' ', '_')); return src; } }; } } And got this error: <lst name= "error" ><str name= "msg" >mutate returned field with different name: field with spaces =&gt; field_with_spaces</str><str name= "trace" >org.apache.solr.common.SolrException: mutate returned field with different name: field with spaces =&gt; field_with_spaces... Are there problems that would result when changing the name of a field in FieldMutatingUpdateProcessor?
        Hide
        Hoss Man added a comment -

        Are there problems that would result when changing the name of a field in FieldMutatingUpdateProcessor?

        i suspect i put that in as a sanity check to protect the the surface area of the API – i don't know if relaxing that will cause problems, or if it's just something that's there because the ramifications of allowing it aren't really well tested in the rest of the FieldMutating code paths.

        in particular: what does it mean? should the old field name be removed? should the corisponding field:value pair be rmeoved, but other instances of that field:value2 be left in (ie: what if the mutator renames one instance of the field but not another?)

        easiest thing would probably be to implement field renaming it as a complete one-off special UpdateProcessor w/o using hte FieldMutating framework (ie: no config, just something barebones for use in schemaless that can maybe later be re-parented in the class hierarchy to support more config options)

        Show
        Hoss Man added a comment - Are there problems that would result when changing the name of a field in FieldMutatingUpdateProcessor? i suspect i put that in as a sanity check to protect the the surface area of the API – i don't know if relaxing that will cause problems, or if it's just something that's there because the ramifications of allowing it aren't really well tested in the rest of the FieldMutating code paths. in particular: what does it mean? should the old field name be removed? should the corisponding field:value pair be rmeoved, but other instances of that field:value2 be left in (ie: what if the mutator renames one instance of the field but not another?) easiest thing would probably be to implement field renaming it as a complete one-off special UpdateProcessor w/o using hte FieldMutating framework (ie: no config, just something barebones for use in schemaless that can maybe later be re-parented in the class hierarchy to support more config options)
        Hide
        Noble Paul added a comment - - edited

        A new URP called FieldNameMutatingUpdateProcessorFactory
        example configuration to replace spaces with underscores

            <processor class="solr.FieldNameMutatingUpdateProcessorFactory">
              <str name="pattern">\s</str>
              <str name="replacement">_</str>
            </processor>
        

        no test cases yet

        Show
        Noble Paul added a comment - - edited A new URP called FieldNameMutatingUpdateProcessorFactory example configuration to replace spaces with underscores <processor class= "solr.FieldNameMutatingUpdateProcessorFactory" > <str name= "pattern" > \s </str> <str name= "replacement" > _ </str> </processor> no test cases yet
        Hide
        Erik Hatcher added a comment -

        Noble Paul] - looks good! The pattern should be expanded to include all the funky/problematic/illegal characters before committing, but in general +1.

        Show
        Erik Hatcher added a comment - Noble Paul ] - looks good! The pattern should be expanded to include all the funky/problematic/illegal characters before committing, but in general +1.
        Hide
        Noble Paul added a comment -

        The only catch is , if there are multiple patterns to match you need multiple <processor> tags . I hope it is OK

        Show
        Noble Paul added a comment - The only catch is , if there are multiple patterns to match you need multiple <processor> tags . I hope it is OK
        Hide
        Grant Ingersoll added a comment -

        +1

        Show
        Grant Ingersoll added a comment - +1
        Hide
        Noble Paul added a comment -

        with tests
        and now replaces all non word chars except hyphen -

        Show
        Noble Paul added a comment - with tests and now replaces all non word chars except hyphen -
        Hide
        ASF subversion and git services added a comment -

        Commit 1651587 from Noble Paul in branch 'dev/trunk'
        [ https://svn.apache.org/r1651587 ]

        SOLR-6937 In schemaless mode ,replace spaces and special characters with underscore

        Show
        ASF subversion and git services added a comment - Commit 1651587 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1651587 ] SOLR-6937 In schemaless mode ,replace spaces and special characters with underscore
        Hide
        ASF subversion and git services added a comment -

        Commit 1651588 from Noble Paul in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1651588 ]

        SOLR-6937 In schemaless mode ,replace spaces and special characters with underscore

        Show
        ASF subversion and git services added a comment - Commit 1651588 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1651588 ] SOLR-6937 In schemaless mode ,replace spaces and special characters with underscore
        Hide
        ASF subversion and git services added a comment -

        Commit 1651589 from Noble Paul in branch 'dev/branches/lucene_solr_5_0'
        [ https://svn.apache.org/r1651589 ]

        SOLR-6937 In schemaless mode ,replace spaces and special characters with underscore

        Show
        ASF subversion and git services added a comment - Commit 1651589 from Noble Paul in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1651589 ] SOLR-6937 In schemaless mode ,replace spaces and special characters with underscore
        Hide
        ASF subversion and git services added a comment -

        Commit 1651646 from Noble Paul in branch 'dev/trunk'
        [ https://svn.apache.org/r1651646 ]

        SOLR-6937 don't replace periods

        Show
        ASF subversion and git services added a comment - Commit 1651646 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1651646 ] SOLR-6937 don't replace periods
        Hide
        ASF subversion and git services added a comment -

        Commit 1651647 from Noble Paul in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1651647 ]

        SOLR-6937 don't replace periods

        Show
        ASF subversion and git services added a comment - Commit 1651647 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1651647 ] SOLR-6937 don't replace periods
        Hide
        ASF subversion and git services added a comment -

        Commit 1651648 from Noble Paul in branch 'dev/branches/lucene_solr_5_0'
        [ https://svn.apache.org/r1651648 ]

        SOLR-6937 don't replace periods

        Show
        ASF subversion and git services added a comment - Commit 1651648 from Noble Paul in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1651648 ] SOLR-6937 don't replace periods
        Hide
        ASF subversion and git services added a comment -

        Commit 1652651 from Noble Paul in branch 'dev/branches/lucene_solr_5_0'
        [ https://svn.apache.org/r1652651 ]

        reverting SOLR-6937

        Show
        ASF subversion and git services added a comment - Commit 1652651 from Noble Paul in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1652651 ] reverting SOLR-6937
        Hide
        ASF subversion and git services added a comment -

        Commit 1652967 from Noble Paul in branch 'dev/branches/lucene_solr_5_0'
        [ https://svn.apache.org/r1652967 ]

        SOLR-6937 In schemaless mode ,replace spaces and special characters with underscore, This impacts usability so ,it is voted to be in 5.0

        Show
        ASF subversion and git services added a comment - Commit 1652967 from Noble Paul in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1652967 ] SOLR-6937 In schemaless mode ,replace spaces and special characters with underscore, This impacts usability so ,it is voted to be in 5.0
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Noble Paul
            Reporter:
            Grant Ingersoll
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development