Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9526

data_driven configs defaults to "strings" for unmapped fields, makes most fields containing "textual content" unsearchable, breaks tutorial examples

    Details

      Description

      James Pritchett pointed out on the solr-user list that this sample query from the quick start tutorial matched no docs (even though the tutorial text says "The above request returns only one document")...

      http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=name:foundation

      The root problem seems to be that the add-unknown-fields-to-the-schema chain in data_driven_schema_configs is configured with...

      <str name="defaultFieldType">strings</str>
      

      ...and the "strings" type uses StrField and is not tokenized.


      Original thread: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201609.mbox/%3CCAC-n2zRPsspfnK43AGeCspchc5b-0FF25xLfnzogYuVyg2dWbw@mail.gmail.com%3E

      1. SOLR-9526.patch
        45 kB
        Jan Høydahl
      2. SOLR-9526.patch
        41 kB
        Jan Høydahl
      3. SOLR-9526.patch
        27 kB
        Steve Rowe
      4. SOLR-9526.patch
        27 kB
        Jan Høydahl
      5. SOLR-9526.patch
        26 kB
        Jan Høydahl
      6. SOLR-9526.patch
        26 kB
        Jan Høydahl
      7. SOLR-9526.patch
        16 kB
        Jan Høydahl

        Issue Links

          Activity

          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Bulk close after 7.1.0 release

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Bulk close after 7.1.0 release
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 562613dc8f7906b5b7c123a6a6ed5726674e09e4 in lucene-solr's branch refs/heads/branch_7_0 from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=562613d ]

          SOLR-9526: Update Ref Guide for schemaless changes

          Show
          jira-bot ASF subversion and git services added a comment - Commit 562613dc8f7906b5b7c123a6a6ed5726674e09e4 in lucene-solr's branch refs/heads/branch_7_0 from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=562613d ] SOLR-9526 : Update Ref Guide for schemaless changes
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1ecc0344ef45571bf8aaf84c8a37e8d18e17a0c2 in lucene-solr's branch refs/heads/branch_7x from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1ecc034 ]

          SOLR-9526: Update Ref Guide for schemaless changes

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1ecc0344ef45571bf8aaf84c8a37e8d18e17a0c2 in lucene-solr's branch refs/heads/branch_7x from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1ecc034 ] SOLR-9526 : Update Ref Guide for schemaless changes
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit aff647ecfaf5af3bbeb2363b82821c53c5df7f3d in lucene-solr's branch refs/heads/master from Cassandra Targett
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aff647e ]

          SOLR-9526: Update Ref Guide for schemaless changes

          Show
          jira-bot ASF subversion and git services added a comment - Commit aff647ecfaf5af3bbeb2363b82821c53c5df7f3d in lucene-solr's branch refs/heads/master from Cassandra Targett [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aff647e ] SOLR-9526 : Update Ref Guide for schemaless changes
          Hide
          janhoy Jan Høydahl added a comment -

          Thanks Steve.

          Show
          janhoy Jan Høydahl added a comment - Thanks Steve.
          Hide
          steve_rowe Steve Rowe added a comment - - edited

          I brought the AddSchemaFieldsUpdateProcessorFactory javadocs up to date.

          I also looked at all mentions of "schemaless" and "data-driven" in the ref guide, and didn't find any other places that needed updating.

          Show
          steve_rowe Steve Rowe added a comment - - edited I brought the AddSchemaFieldsUpdateProcessorFactory javadocs up to date. I also looked at all mentions of "schemaless" and "data-driven" in the ref guide, and didn't find any other places that needed updating.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 510608decb5e4ce5b6184d86662af5bd33e1be11 in lucene-solr's branch refs/heads/master from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=510608d ]

          SOLR-9526: fix javadocs

          Show
          jira-bot ASF subversion and git services added a comment - Commit 510608decb5e4ce5b6184d86662af5bd33e1be11 in lucene-solr's branch refs/heads/master from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=510608d ] SOLR-9526 : fix javadocs
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 20fccd286d207363b154fca61e5aa49824dbf295 in lucene-solr's branch refs/heads/branch_7x from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=20fccd2 ]

          SOLR-9526: fix javadocs

          Show
          jira-bot ASF subversion and git services added a comment - Commit 20fccd286d207363b154fca61e5aa49824dbf295 in lucene-solr's branch refs/heads/branch_7x from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=20fccd2 ] SOLR-9526 : fix javadocs
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5e7fa4ceee8f31fcf90254e96d1476281faa922b in lucene-solr's branch refs/heads/branch_7_0 from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5e7fa4c ]

          SOLR-9526: fix javadocs

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5e7fa4ceee8f31fcf90254e96d1476281faa922b in lucene-solr's branch refs/heads/branch_7_0 from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5e7fa4c ] SOLR-9526 : fix javadocs
          Hide
          janhoy Jan Høydahl added a comment -

          Finally this is done! Thanks to all participants. Will be exciting to see user reactions when they try this in 7.0.

          I urge all committers to give it a spin right now and also open new JIRAs for bugs, documentation that is wrong due to this etc.

          Show
          janhoy Jan Høydahl added a comment - Finally this is done! Thanks to all participants. Will be exciting to see user reactions when they try this in 7.0. I urge all committers to give it a spin right now and also open new JIRAs for bugs, documentation that is wrong due to this etc.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 257883d65c4cc4c366493a6d0cae908fbccaca8f in lucene-solr's branch refs/heads/branch_7_0 from Jan Høydahl
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=257883d ]

          SOLR-9526: Data driven schema now indexes text field "foo" as both "foo" (text_general) and as "foo_str" (string) to facilitate both search and faceting

          (cherry picked from commit a60ec1b)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 257883d65c4cc4c366493a6d0cae908fbccaca8f in lucene-solr's branch refs/heads/branch_7_0 from Jan Høydahl [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=257883d ] SOLR-9526 : Data driven schema now indexes text field "foo" as both "foo" (text_general) and as "foo_str" (string) to facilitate both search and faceting (cherry picked from commit a60ec1b)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 451a203f2de8393b69751bf4351896cfc87bd9bd in lucene-solr's branch refs/heads/branch_7x from Jan Høydahl
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=451a203 ]

          SOLR-9526: Data driven schema now indexes text field "foo" as both "foo" (text_general) and as "foo_str" (string) to facilitate both search and faceting

          (cherry picked from commit a60ec1b)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 451a203f2de8393b69751bf4351896cfc87bd9bd in lucene-solr's branch refs/heads/branch_7x from Jan Høydahl [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=451a203 ] SOLR-9526 : Data driven schema now indexes text field "foo" as both "foo" (text_general) and as "foo_str" (string) to facilitate both search and faceting (cherry picked from commit a60ec1b)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a60ec1b4321b023ec868d77bce71660e5a19ce47 in lucene-solr's branch refs/heads/master from Jan Høydahl
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a60ec1b ]

          SOLR-9526: Data driven schema now indexes text field "foo" as both "foo" (text_general) and as "foo_str" (string) to facilitate both search and faceting

          Show
          jira-bot ASF subversion and git services added a comment - Commit a60ec1b4321b023ec868d77bce71660e5a19ce47 in lucene-solr's branch refs/heads/master from Jan Høydahl [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a60ec1b ] SOLR-9526 : Data driven schema now indexes text field "foo" as both "foo" (text_general) and as "foo_str" (string) to facilitate both search and faceting
          Hide
          janhoy Jan Høydahl added a comment -

          New patch

          • Fix test failure TestConfigSetsAPI.testUserAndTestDefaultConfigsetsAreSame

          Now both ant test and ant precommit succeeds on my Mac! Will commit this later today.. We can followup with doc fixes as we come across them.

          Show
          janhoy Jan Høydahl added a comment - New patch Fix test failure TestConfigSetsAPI.testUserAndTestDefaultConfigsetsAreSame Now both ant test and ant precommit succeeds on my Mac! Will commit this later today.. We can followup with doc fixes as we come across them.
          Hide
          janhoy Jan Høydahl added a comment -

          Thanks a lot for the thorough work. I'm attaching another iteration

          • Added CHANGES.txt entries for "Upgrading" and "New features" sections. Please review.
          • Removed the need for <str name="defaultFieldType">strings</str> when one of the typeMappings has the new tag <bool name="default">true</bool>, also removed this from solrconfigs
          • Updated Ref-Guide, mainly schemaless-mode.adoc, to discuss the copyField. There may be other locations, examples etc that also needs update...

          Precommit passes. Several test failures but that is unrelated as far as I can tell

          Show
          janhoy Jan Høydahl added a comment - Thanks a lot for the thorough work. I'm attaching another iteration Added CHANGES.txt entries for "Upgrading" and "New features" sections. Please review. Removed the need for <str name="defaultFieldType">strings</str> when one of the typeMappings has the new tag <bool name="default">true</bool> , also removed this from solrconfigs Updated Ref-Guide, mainly schemaless-mode.adoc , to discuss the copyField. There may be other locations, examples etc that also needs update... Precommit passes. Several test failures but that is unrelated as far as I can tell
          Hide
          steve_rowe Steve Rowe added a comment - - edited

          Attaching patch brought up to date with master (in particular, collapsing of data_driven_schema_configs and basic_configs into _default) - note that your original patch only modified solrconfig.xml on one of these and managed_schema on the other - I assume you had/have local changes that didn't make it into the patch Jan Høydahl? I made a couple of other changes; details below.

          See new NOCOMMIT comments. I was using the ManagedIndexSchema method

          public ManagedIndexSchema addCopyFields(String source, Collection<String> destinations, int maxChars)
          

          which does not have a persist=true/false argument, so calling it leaves the schema not persisted. Then I could not find a way to explicitly persist it since method
          boolean persistManagedSchema(boolean createOnly)
          was not public. In this patch I've made it public and done a hacky instanceof check in AddSchemaFieldsUpdateProcessorFactory

          if (newSchema instanceof ManagedIndexSchema) {
            // NOCOMMIT: Hack to avoid persisting schema once after addFields and then once after each copyField
            ((ManagedIndexSchema)newSchema).persistManagedSchema(false);
          }
          

          Steve Rowe, you wrote the addCopyFields() method a while ago, is there a cleaner way to make sure schema is persisted after adding a copyField?

          The design of ManagedIndexSchema's API was in support of the Schema REST API, where each resource was modifiable one at a time; "bulk" modifications weren't possible. In the new bulk schema API, though, the ordinary case involves multiple modifications; in this case, it is counter-productive to persist in the middle of a set of operations.

          SOLR-6476 (introducing schema "bulk" mode) added the option to not persist the schema after an operation; previously every operation was automatically persisted. This was added as an option because at the time, bulk and REST modes co-existed. SOLR-7682 added the ability to specify maxChars for copyField directives, and I intentionally left off the persist option of the new addCopyFields() method, because there was (intentionally) no way to invoke this capability via the (now deprecated) schema REST API, and the bulk schema API didn't need the persist option.

          Long story short: I think making persistManagedSchema() public is a natural consequence of the bulk schema API (and in support of bulk operations from other sources, e.g. this issue). It's just that nobody had gotten around to it yet.

          In AddSchemaFieldsUpdateProcessorFactory.processAdd() in my patch I removed the instanceof ManagedIndexSchema check wrapping the call to persistManagedSchama(), as well as the NOCOMMIT's, since the check if ( ! cmd.getReq().getSchema().isMutable()) at the beginning of the method already ensures that we're dealing with a ManagedIndexSchema.

          I also removed the following typeMapping that was added in your patch from URP chains add-fields-no-run-processor and parse-and-add-fields in solrconfig-add-schema-fields-update-processor-chains.xml - I'm assuming this is a vestige from an earlier concept of removing <defaultTypeMapping>, since these chains have <str name="defaultFieldType">text</str>? AddSchemaFieldsUpdateProcessorFactoryTest passes with my change:

          <lst name="typeMapping">
            <str name="valueClass">java.lang.String</str>
            <str name="fieldType">text</str>
          </lst>
          
          Show
          steve_rowe Steve Rowe added a comment - - edited Attaching patch brought up to date with master (in particular, collapsing of data_driven_schema_configs and basic_configs into _default ) - note that your original patch only modified solrconfig.xml on one of these and managed_schema on the other - I assume you had/have local changes that didn't make it into the patch Jan Høydahl ? I made a couple of other changes; details below. See new NOCOMMIT comments. I was using the ManagedIndexSchema method public ManagedIndexSchema addCopyFields( String source, Collection< String > destinations, int maxChars) which does not have a persist=true/false argument, so calling it leaves the schema not persisted. Then I could not find a way to explicitly persist it since method boolean persistManagedSchema(boolean createOnly) was not public. In this patch I've made it public and done a hacky instanceof check in AddSchemaFieldsUpdateProcessorFactory if (newSchema instanceof ManagedIndexSchema) { // NOCOMMIT: Hack to avoid persisting schema once after addFields and then once after each copyField ((ManagedIndexSchema)newSchema).persistManagedSchema( false ); } Steve Rowe, you wrote the addCopyFields() method a while ago, is there a cleaner way to make sure schema is persisted after adding a copyField? The design of ManagedIndexSchema 's API was in support of the Schema REST API, where each resource was modifiable one at a time; "bulk" modifications weren't possible. In the new bulk schema API, though, the ordinary case involves multiple modifications; in this case, it is counter-productive to persist in the middle of a set of operations. SOLR-6476 (introducing schema "bulk" mode) added the option to not persist the schema after an operation; previously every operation was automatically persisted. This was added as an option because at the time, bulk and REST modes co-existed. SOLR-7682 added the ability to specify maxChars for copyField directives, and I intentionally left off the persist option of the new addCopyFields() method, because there was (intentionally) no way to invoke this capability via the (now deprecated) schema REST API, and the bulk schema API didn't need the persist option. Long story short: I think making persistManagedSchema() public is a natural consequence of the bulk schema API (and in support of bulk operations from other sources, e.g. this issue). It's just that nobody had gotten around to it yet. In AddSchemaFieldsUpdateProcessorFactory.processAdd() in my patch I removed the instanceof ManagedIndexSchema check wrapping the call to persistManagedSchama() , as well as the NOCOMMIT 's, since the check if ( ! cmd.getReq().getSchema().isMutable()) at the beginning of the method already ensures that we're dealing with a ManagedIndexSchema . I also removed the following typeMapping that was added in your patch from URP chains add-fields-no-run-processor and parse-and-add-fields in solrconfig-add-schema-fields-update-processor-chains.xml - I'm assuming this is a vestige from an earlier concept of removing <defaultTypeMapping> , since these chains have <str name="defaultFieldType">text</str> ? AddSchemaFieldsUpdateProcessorFactoryTest passes with my change: <lst name= "typeMapping" > <str name= "valueClass" > java.lang.String </str> <str name= "fieldType" > text </str> </lst>
          Hide
          steve_rowe Steve Rowe added a comment -

          Any luck Steve Rowe? I'd like for this to be in 7.0 from the get go to have a better OOTB experience with field guessing now that _default schema will be even more used.

          Sorry, didn't look yet, doing so now.

          Show
          steve_rowe Steve Rowe added a comment - Any luck Steve Rowe? I'd like for this to be in 7.0 from the get go to have a better OOTB experience with field guessing now that _default schema will be even more used. Sorry, didn't look yet, doing so now.
          Hide
          anshumg Anshum Gupta added a comment - - edited

          I think we can get this into 7.0. I'm fine with this as it's an improvement that fixes things.

          Show
          anshumg Anshum Gupta added a comment - - edited I think we can get this into 7.0. I'm fine with this as it's an improvement that fixes things.
          Hide
          janhoy Jan Høydahl added a comment -

          Any luck Steve Rowe? I'd like for this to be in 7.0 from the get go to have a better OOTB experience with field guessing now that _default schema will be even more used.

          Show
          janhoy Jan Høydahl added a comment - Any luck Steve Rowe ? I'd like for this to be in 7.0 from the get go to have a better OOTB experience with field guessing now that _default schema will be even more used.
          Hide
          steve_rowe Steve Rowe added a comment -

          Steve Rowe please fill in your wisdom regarding my question above

          Sure, sorry for the delay, I'll investigate today and let you know what I find. (It's been long enough that I don't remember the situation there.)

          Show
          steve_rowe Steve Rowe added a comment - Steve Rowe please fill in your wisdom regarding my question above Sure, sorry for the delay, I'll investigate today and let you know what I find. (It's been long enough that I don't remember the situation there.)
          Hide
          janhoy Jan Høydahl added a comment -

          Steve Rowe please fill in your wisdom regarding my question above

          Show
          janhoy Jan Høydahl added a comment - Steve Rowe please fill in your wisdom regarding my question above
          Hide
          janhoy Jan Høydahl added a comment -

          I have recorded a Terminal session recording to show how this patch works, from creating a collection, to adding a doc, inspecting schema and verifying that the string copy is cutoff. Enjoy:

          Alexandre Rafalovitch Don't you agree that this approach is better than having some field copying being done in URP and some in schema? You can now:

          1. Create a collection
          2. Define some fields up-front with schema REST API
          3. Start indexing documents and let other fields be guessed, searchable and facetable (_str)
          4. Inspect the schema created, and if you're happy you can switch to update.autoCreateFields=false or even copy the schema to another collection
          5. If you're not happy with some field guessing, you can modify schema with the API, changing type, removing/adding *_str copyField rules etc
          6. You can even create a typeMapping in the add-unknown-fields-to-the-schema chain that will copy all Integers to a _f float version or any other combination if it makes sense for you
          Show
          janhoy Jan Høydahl added a comment - I have recorded a Terminal session recording to show how this patch works, from creating a collection, to adding a doc, inspecting schema and verifying that the string copy is cutoff. Enjoy: Alexandre Rafalovitch Don't you agree that this approach is better than having some field copying being done in URP and some in schema? You can now: Create a collection Define some fields up-front with schema REST API Start indexing documents and let other fields be guessed, searchable and facetable (_str) Inspect the schema created, and if you're happy you can switch to update.autoCreateFields=false or even copy the schema to another collection If you're not happy with some field guessing, you can modify schema with the API, changing type, removing/adding *_str copyField rules etc You can even create a typeMapping in the add-unknown-fields-to-the-schema chain that will copy all Integers to a _f float version or any other combination if it makes sense for you
          Hide
          janhoy Jan Høydahl added a comment -

          New patch and updated PR https://github.com/apache/lucene-solr/pull/91

          • Fixed bug that did not persist copyFields to schema

          See new NOCOMMIT comments. I was using the ManagedIndexSchema method

          public ManagedIndexSchema addCopyFields(String source, Collection<String> destinations, int maxChars)
          

          which does not have a persist=true/false argument, so calling it leaves the schema not persisted. Then I could not find a way to explicitly persist it since method

          boolean persistManagedSchema(boolean createOnly)
          

          was not public. In this patch I've made it public and done a hacky instanceof check in AddSchemaFieldsUpdateProcessorFactory

          if (newSchema instanceof ManagedIndexSchema) {
            // NOCOMMIT: Hack to avoid persisting schema once after addFields and then once after each copyField
            ((ManagedIndexSchema)newSchema).persistManagedSchema(false);
          }
          

          Steve Rowe, you wrote the addCopyFields() method a while ago, is there a cleaner way to make sure schema is persisted after adding a copyField?

          Show
          janhoy Jan Høydahl added a comment - New patch and updated PR https://github.com/apache/lucene-solr/pull/91 Fixed bug that did not persist copyFields to schema See new NOCOMMIT comments. I was using the ManagedIndexSchema method public ManagedIndexSchema addCopyFields( String source, Collection< String > destinations, int maxChars) which does not have a persist=true/false argument, so calling it leaves the schema not persisted. Then I could not find a way to explicitly persist it since method boolean persistManagedSchema( boolean createOnly) was not public. In this patch I've made it public and done a hacky instanceof check in AddSchemaFieldsUpdateProcessorFactory if (newSchema instanceof ManagedIndexSchema) { // NOCOMMIT: Hack to avoid persisting schema once after addFields and then once after each copyField ((ManagedIndexSchema)newSchema).persistManagedSchema( false ); } Steve Rowe , you wrote the addCopyFields() method a while ago, is there a cleaner way to make sure schema is persisted after adding a copyField?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user janhoy commented on the issue:

          https://github.com/apache/lucene-solr/pull/91

          Pull request updated to current master. Please review.

          Show
          githubbot ASF GitHub Bot added a comment - Github user janhoy commented on the issue: https://github.com/apache/lucene-solr/pull/91 Pull request updated to current master. Please review.
          Hide
          janhoy Jan Høydahl added a comment -

          What do folks think about the approach taken in the patch? I'm thinking about updating it to apply to master.

          Show
          janhoy Jan Høydahl added a comment - What do folks think about the approach taken in the patch? I'm thinking about updating it to apply to master.
          Hide
          janhoy Jan Høydahl added a comment -

          Updated patch:

          • maxChars settings now work
          • Supports multiple copyField per typeMapping
          • Possible to let one of the defined typeMappings be "default" instead of falling back to defaultFieldType. This allows a new field with unknown / mixed-type value-type to use the type and copyField of a mapping
          • Changed tests to validate that the schema is modified correclty
          • Added an actual indexing/query test validating that the cutoff works
          • The data-driven-config now defaults to text_general instead of string, and for java.lang.String types it adds a *_str copyField with maxChars=256
          • Removed useDocValuesAsStored="false" from the dynamicField *_str definition, meaning the *_str copy will be visible in search results (from docValues). Think this is more intuitive for beginners and easier to explain in tutorials
          • Removed indexed="true" to save space and simplify things, filtering will still work, if not as efficient?
          Show
          janhoy Jan Høydahl added a comment - Updated patch: maxChars settings now work Supports multiple copyField per typeMapping Possible to let one of the defined typeMappings be "default" instead of falling back to defaultFieldType . This allows a new field with unknown / mixed-type value-type to use the type and copyField of a mapping Changed tests to validate that the schema is modified correclty Added an actual indexing/query test validating that the cutoff works The data-driven-config now defaults to text_general instead of string, and for java.lang.String types it adds a *_str copyField with maxChars=256 Removed useDocValuesAsStored="false" from the dynamicField *_str definition, meaning the *_str copy will be visible in search results (from docValues). Think this is more intuitive for beginners and easier to explain in tutorials Removed indexed="true" to save space and simplify things, filtering will still work, if not as efficient?
          Hide
          dsmiley David Smiley added a comment -

          I think he 1st priority should be making the data_driven configs work well for trivial examples and the tutorial - by the time a user starts thinking about explicit fields they wnat, and explicit copy/clones they want, they should be thinking about overridding/eliminating/disabling all of the "schemaless" features anyway.

          +1 to that!

          Show
          dsmiley David Smiley added a comment - I think he 1st priority should be making the data_driven configs work well for trivial examples and the tutorial - by the time a user starts thinking about explicit fields they wnat, and explicit copy/clones they want, they should be thinking about overridding/eliminating/disabling all of the "schemaless" features anyway. +1 to that!
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user janhoy opened a pull request:

          https://github.com/apache/lucene-solr/pull/91

          SOLR-9526: Next iteration on data driven schema, _str copyField

          Added support for a default typeMapping.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/cominvent/lucene-solr solr9526-datadriven

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/91.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #91


          commit e7fad966553b51231c36d8d5368f04caf54083f8
          Author: Jan Høydahl <janhoy@apache.org>
          Date: 2016-10-07T00:40:36Z

          SOLR-9526: First patch for data-driven string clone, support for default typeMapping, tests pass


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user janhoy opened a pull request: https://github.com/apache/lucene-solr/pull/91 SOLR-9526 : Next iteration on data driven schema, _str copyField Added support for a default typeMapping. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cominvent/lucene-solr solr9526-datadriven Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/91.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #91 commit e7fad966553b51231c36d8d5368f04caf54083f8 Author: Jan Høydahl <janhoy@apache.org> Date: 2016-10-07T00:40:36Z SOLR-9526 : First patch for data-driven string clone, support for default typeMapping, tests pass
          Hide
          janhoy Jan Høydahl added a comment -

          Ok, I tested the add-copy-field approach and it works so far.
          Added a test case that validates that both fields get created with correct type.
          Attached is a preliminary patch with lots of TODO. So far it lacks support for maxChars cutoff.

          Show
          janhoy Jan Høydahl added a comment - Ok, I tested the add-copy-field approach and it works so far. Added a test case that validates that both fields get created with correct type. Attached is a preliminary patch with lots of TODO. So far it lacks support for maxChars cutoff.
          Hide
          hossman Hoss Man added a comment -

          Hoss Man why did you propose both docValues=true and indexed=true for *_str?

          I don't remember concretely, but I'm guessing my thinking was:

          • (inherit) docValues=true (from the fieldType) so we get the most efficient faceting
          • indexed=true so we can get the most efficient filtering
          • stored=false because this is a redundant copy of another field that's already stored
          • useDocValuesAsStored=false for the same reason.

          Also, it is unfortunate to split your "schema" across the schema file and a solrconfig URP. Take the example where you want to use data driven schema, ...

          I think he 1st priority should be making the data_driven configs work well for trivial examples and the tutorial - by the time a user starts thinking about explicit fields they wnat, and explicit copy/clones they want, they should be thinking about overridding/eliminating/disabling all of the "schemaless" features anyway.

          Show
          hossman Hoss Man added a comment - Hoss Man why did you propose both docValues=true and indexed=true for *_str ? I don't remember concretely, but I'm guessing my thinking was: (inherit) docValues=true (from the fieldType) so we get the most efficient faceting indexed=true so we can get the most efficient filtering stored=false because this is a redundant copy of another field that's already stored useDocValuesAsStored=false for the same reason. Also, it is unfortunate to split your "schema" across the schema file and a solrconfig URP. Take the example where you want to use data driven schema, ... I think he 1st priority should be making the data_driven configs work well for trivial examples and the tutorial - by the time a user starts thinking about explicit fields they wnat, and explicit copy/clones they want, they should be thinking about overridding/eliminating/disabling all of the "schemaless" features anyway.
          Hide
          janhoy Jan Høydahl added a comment -

          In Hoss' proposal above, he suggests the following:

          Add <dynamicField name="*_str" type="strings" useDocValuesAsStored="false" indexed="true" stored="false"/> to the managed-schema...

          So that dynField will already exist. And since this will be a feature that will need to be explained in tutorials etc (why to facet on city_str and not city, it is nice if the suffix is meaningful.

          Hoss Man why did you propose both docValues=true and indexed=true for *_str?

          Show
          janhoy Jan Høydahl added a comment - In Hoss' proposal above, he suggests the following: Add <dynamicField name="*_str" type="strings" useDocValuesAsStored="false" indexed="true" stored="false"/> to the managed-schema... So that dynField will already exist. And since this will be a feature that will need to be explained in tutorials etc (why to facet on city_str and not city , it is nice if the suffix is meaningful. Hoss Man why did you propose both docValues=true and indexed=true for *_str ?
          Hide
          arafalov Alexandre Rafalovitch added a comment - - edited

          Actually copyField already has a limiting parameter, it is called maxChars. So, we just need to generate the instruction. And I don't think we have a lot of flexibility on original field name (unless we support multiple matches and multiple ways to generate copyField), so we probably don't need to match it in anyway. We just need to indicate the target field construction pattern, which will need to be materialized if we are creating a separate copyField for each original field.

          So it would look something like this:

          <lst name="typeMapping">
                  <str name="valueClass">java.lang.String</str>
                  <str name="fieldType">text_general</str>
                  <lst name="copyField">
                    <str name="dest">*_ss</str>
                    <int name="maxChars">256</int>
                  </lst>
          </lst>
          

          And for a field "xyz" it would generate:

          <copyField source="xyz" dest="xyz_ss" maxChars="256"/>
          

          Hoss' proposal is nicer in that it is more flexible (we could put any URP sequence there) and we could generate different matching patterns. But as already mentioned, doing the URP-side copying is a bit more challenging. Especially since CloneField URP does not actually inherit FieldMutating URP (perhaps it should). And what happens if people want to remove the schemaless mode when going into production, will this suddenly break the setup and content stops flowing from text field to the string?

          (Edit) field has to be _ss because - I assume - we are generating a copyField to a dynamicField that has to be strings *multiValued=true". Unless we are generating individual "xyz_str" fields as well, in which case, perhaps the syntax should not look like copyField at all as we are generating 3 instructions instead of 1 before.

          Show
          arafalov Alexandre Rafalovitch added a comment - - edited Actually copyField already has a limiting parameter, it is called maxChars. So, we just need to generate the instruction. And I don't think we have a lot of flexibility on original field name (unless we support multiple matches and multiple ways to generate copyField), so we probably don't need to match it in anyway. We just need to indicate the target field construction pattern, which will need to be materialized if we are creating a separate copyField for each original field. So it would look something like this: <lst name="typeMapping"> <str name="valueClass">java.lang.String</str> <str name="fieldType">text_general</str> <lst name="copyField"> <str name="dest">*_ss</str> <int name="maxChars">256</int> </lst> </lst> And for a field "xyz" it would generate: <copyField source="xyz" dest="xyz_ss" maxChars="256"/> Hoss' proposal is nicer in that it is more flexible (we could put any URP sequence there) and we could generate different matching patterns. But as already mentioned, doing the URP-side copying is a bit more challenging. Especially since CloneField URP does not actually inherit FieldMutating URP (perhaps it should). And what happens if people want to remove the schemaless mode when going into production, will this suddenly break the setup and content stops flowing from text field to the string? (Edit) field has to be _ss because - I assume - we are generating a copyField to a dynamicField that has to be strings *multiValued=true". Unless we are generating individual "xyz_str" fields as well, in which case, perhaps the syntax should not look like copyField at all as we are generating 3 instructions instead of 1 before.
          Hide
          janhoy Jan Høydahl added a comment -

          This is the approach that ES will take in 5.x too, see https://www.elastic.co/blog/strings-are-dead-long-live-strings
          When auto guessing they will index the field, say "city" as full-text, and also add a string/keyword copy as "city.keyword". This can be changed by modifying mappings.

          Instead of the "exclude" params, perhaps we should have a way to cutoff the string copy at e.g. 256 chars, I mean, when would you need longer facet values?

          Also, it is unfortunate to split your "schema" across the schema file and a solrconfig URP. Take the example where you want to use data driven schema, but want to lock a few key fields up front by issuing add-field commands. With Hoss' suggestion this would work fine if you lock e.g. <field name="city" fieldType="string" />, but what if you want to force it into e.g. a Norwegian text with <field name="city" fieldType="text_no" />. Then the CloneFieldUpdateProcessorFactory would still run, creating the city_str copy. That would be confusing.

          So I'm thinking if it would be best to bake this feature more integrated with AddSchemaFieldsUpdateProcessorFactory, so that when an unknown field name with String content comes in, we create a text_general field for it, but we also create a copyFIeld in the schema for it, e.g. <copyField source="city" dest="city_txt" cutoff="256"/>. This means we'd add a cutoff feature to today's copyFIeld, but we have the rest of what we need. Sample UPF:

              <processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
                <str name="defaultFieldType">text_general</str>
                <lst name="typeMapping">
                  <str name="valueClass">java.lang.String</str>
                  <str name="fieldType">text_general</str>
                  <lst name="copyField">
                    <str name="pattern">^(.*)$</str>
                    <str name="replacement">$1_str</str>
                    <int name="cutoff">256</int>
                  </lst>
                </lst>
                <lst name="typeMapping">
                  <str name="valueClass">java.lang.Boolean</str>
                  <str name="fieldType">booleans</str>
                </lst>
                <lst name="typeMapping">
                  <str name="valueClass">java.util.Date</str>
                  <str name="fieldType">tdates</str>
                </lst>
                <lst name="typeMapping">
                  <str name="valueClass">java.lang.Long</str>
                  <str name="valueClass">java.lang.Integer</str>
                  <str name="fieldType">tlongs</str>
                </lst>
                <lst name="typeMapping">
                  <str name="valueClass">java.lang.Number</str>
                  <str name="fieldType">tdoubles</str>
                </lst>
              </processor>
          

          The result will be that users can configure fields up-front without our logic messing it up, and they can also change ONLY the schema later if they wish to remove the copyFIeld again. Then our defaults would not mess it up either. Users will only need to relate the the schema API!

          Show
          janhoy Jan Høydahl added a comment - This is the approach that ES will take in 5.x too, see https://www.elastic.co/blog/strings-are-dead-long-live-strings When auto guessing they will index the field, say "city" as full-text, and also add a string/keyword copy as "city.keyword". This can be changed by modifying mappings. Instead of the "exclude" params, perhaps we should have a way to cutoff the string copy at e.g. 256 chars, I mean, when would you need longer facet values? Also, it is unfortunate to split your "schema" across the schema file and a solrconfig URP. Take the example where you want to use data driven schema, but want to lock a few key fields up front by issuing add-field commands. With Hoss' suggestion this would work fine if you lock e.g. <field name="city" fieldType="string" /> , but what if you want to force it into e.g. a Norwegian text with <field name="city" fieldType="text_no" /> . Then the CloneFieldUpdateProcessorFactory would still run, creating the city_str copy. That would be confusing. So I'm thinking if it would be best to bake this feature more integrated with AddSchemaFieldsUpdateProcessorFactory , so that when an unknown field name with String content comes in, we create a text_general field for it, but we also create a copyFIeld in the schema for it, e.g. <copyField source="city" dest="city_txt" cutoff="256"/> . This means we'd add a cutoff feature to today's copyFIeld, but we have the rest of what we need. Sample UPF: <processor class= "solr.AddSchemaFieldsUpdateProcessorFactory" > <str name= "defaultFieldType" > text_general </str> <lst name= "typeMapping" > <str name= "valueClass" > java.lang.String </str> <str name= "fieldType" > text_general </str> <lst name= "copyField" > <str name= "pattern" > ^(.*)$ </str> <str name= "replacement" > $1_str </str> <int name= "cutoff" > 256 </int> </lst> </lst> <lst name= "typeMapping" > <str name= "valueClass" > java.lang.Boolean </str> <str name= "fieldType" > booleans </str> </lst> <lst name= "typeMapping" > <str name= "valueClass" > java.util.Date </str> <str name= "fieldType" > tdates </str> </lst> <lst name= "typeMapping" > <str name= "valueClass" > java.lang.Long </str> <str name= "valueClass" > java.lang.Integer </str> <str name= "fieldType" > tlongs </str> </lst> <lst name= "typeMapping" > <str name= "valueClass" > java.lang.Number </str> <str name= "fieldType" > tdoubles </str> </lst> </processor> The result will be that users can configure fields up-front without our logic messing it up, and they can also change ONLY the schema later if they wish to remove the copyFIeld again. Then our defaults would not mess it up either. Users will only need to relate the the schema API!
          Hide
          steve_rowe Steve Rowe added a comment -

          I attached a patch on SOLR-6871 that addresses the per-field search problem - see my comment there for details.

          Show
          steve_rowe Steve Rowe added a comment - I attached a patch on SOLR-6871 that addresses the per-field search problem - see my comment there for details.
          Hide
          steve_rowe Steve Rowe added a comment -

          +1 to hoss's suggested changes

          Show
          steve_rowe Steve Rowe added a comment - +1 to hoss's suggested changes
          Hide
          hossman Hoss Man added a comment -

          Possibly to make facets work out of the box? Just guessing.

          I'm probably the biggest proponent of "featuring" & promoting faceting in solr, and even i think it's absurd for our recomended cofigs to promote faceting at the expense of basic (tokenized) field search.

          Hee's my off the cuff, un tested, straw man suggestion, that seems like it would be 100x better then what we have now...

          • change defaultFieldType back to text_general
          • add this to the processor chain, after AddSchemaFieldsUpdateProcessorFactory...
            <processor class="solr.CloneFieldUpdateProcessorFactory">
             <lst name="source">
              <str name="typeClass">solr.TextField</str>
              <lst name="exclude">
               <!-- large text fieds you don't want for sorting or faceting can be excluded here -->
              </lst>
             </lst>
             <lst name="dest">
              <str name="pattern">^(.*)$</str>
              <str name="replacement">$1_str</str>
             </lst>
            </processor>
            
          • Add <dynamicField name="*_str" type="strings" useDocValuesAsStored="false" indexed="true" stored="false"/> to the managed-schema
          • ?? Add stored="true" to text_general ??
            • All the existing fields/dynamicFields using this type set it explicitly to either true/false, but i think if we want to use it as the defaultFieldType we're going to want to set it to true on the fieldType itself so any fields added by AddSchemaFieldsUpdateProcessorFactory have the value stored (so end users can see them in search results)

          This should fix the most egregious problems like what we see with the broken tutorial (folks add a simple "text" field containing a "name" or a "title" and can't search on "words" in that text field) while still supporting sorting/faceting on short "string" fields by using the _str variant.

          I'm assuming this wouldn't break whatever "auto pick facet" stuff is in velocity, since i'm pretty sure it works by looking for all the solr.StrField fields, but if it does then that should be fixed as a distinct issue – we shouldn't be breaking something as basic as "i want to search for a word in a field" just because it makes the velocity UI harder to use.

          Show
          hossman Hoss Man added a comment - Possibly to make facets work out of the box? Just guessing. I'm probably the biggest proponent of "featuring" & promoting faceting in solr, and even i think it's absurd for our recomended cofigs to promote faceting at the expense of basic (tokenized) field search. Hee's my off the cuff, un tested, straw man suggestion, that seems like it would be 100x better then what we have now... change defaultFieldType back to text_general add this to the processor chain, after AddSchemaFieldsUpdateProcessorFactory... <processor class= "solr.CloneFieldUpdateProcessorFactory" > <lst name= "source" > <str name= "typeClass" >solr.TextField</str> <lst name= "exclude" > <!-- large text fieds you don't want for sorting or faceting can be excluded here --> </lst> </lst> <lst name= "dest" > <str name= "pattern" >^(.*)$</str> <str name= "replacement" >$1_str</str> </lst> </processor> Add <dynamicField name="*_str" type="strings" useDocValuesAsStored="false" indexed="true" stored="false"/> to the managed-schema ?? Add stored="true" to text_general ?? All the existing fields/dynamicFields using this type set it explicitly to either true/false, but i think if we want to use it as the defaultFieldType we're going to want to set it to true on the fieldType itself so any fields added by AddSchemaFieldsUpdateProcessorFactory have the value stored (so end users can see them in search results) This should fix the most egregious problems like what we see with the broken tutorial (folks add a simple "text" field containing a "name" or a "title" and can't search on "words" in that text field) while still supporting sorting/faceting on short "string" fields by using the _str variant. I'm assuming this wouldn't break whatever "auto pick facet" stuff is in velocity, since i'm pretty sure it works by looking for all the solr.StrField fields, but if it does then that should be fixed as a distinct issue – we shouldn't be breaking something as basic as "i want to search for a word in a field" just because it makes the velocity UI harder to use.
          Hide
          steve_rowe Steve Rowe added a comment -

          I'm going to work on updating the quick start tutorial - it should be kept up-to-date, independently of any changes we may decide on for the data driven configset,

          Show
          steve_rowe Steve Rowe added a comment - I'm going to work on updating the quick start tutorial - it should be kept up-to-date, independently of any changes we may decide on for the data driven configset,
          Hide
          arafalov Alexandre Rafalovitch added a comment -

          The problem with adding docValues to the field is that now the stored=false flag is ignored, because we fetch from docValues. In fact, we already saw users super-confused when that happened with one of the example schemas.

          Show
          arafalov Alexandre Rafalovitch added a comment - The problem with adding docValues to the field is that now the stored=false flag is ignored, because we fetch from docValues. In fact, we already saw users super-confused when that happened with one of the example schemas.
          Hide
          janhoy Jan Høydahl added a comment -

          Yea, I remember the discussion about string vs text, and the tradeoff between searchability and facets. Some argued to choose "string" for short strings and "text" for longer strings, but that would be a mess, so we settled on a more consistent behavior. What if we create a new fieldType text_datadriven which has docValues="true", and let the data driven logic always use that one, perhaps with some cutoff for very long texts? It will not be the best fit for all data sets, but then people should do explicit mapping anyway...

          Show
          janhoy Jan Høydahl added a comment - Yea, I remember the discussion about string vs text , and the tradeoff between searchability and facets. Some argued to choose "string" for short strings and "text" for longer strings, but that would be a mess, so we settled on a more consistent behavior. What if we create a new fieldType text_datadriven which has docValues="true" , and let the data driven logic always use that one, perhaps with some cutoff for very long texts? It will not be the best fit for all data sets, but then people should do explicit mapping anyway...
          Hide
          elyograg Shawn Heisey added a comment -

          I would think that adding docValues to the field would allow facets to work like most people expect, while also allowing single-word searches.

          Show
          elyograg Shawn Heisey added a comment - I would think that adding docValues to the field would allow facets to work like most people expect, while also allowing single-word searches.
          Hide
          arafalov Alexandre Rafalovitch added a comment -

          Possibly to make facets work out of the box? Just guessing.

          Show
          arafalov Alexandre Rafalovitch added a comment - Possibly to make facets work out of the box? Just guessing.
          Hide
          hossman Hoss Man added a comment -

          I have no idea if other parts of the tutorial are totally broken, but my biggest question first and foremost is WTF is up with "strings" being the defaultFieldType in data_driven_schema_configs???? that makes no sense to me at all.

          It appears it's been that way since commit 0ff1e75b for SOLR-6779, but there's no explanation i can see in that jira as to why this change was made.

          Has this really been broken this horiffically since 5.0?!?!?!

          Show
          hossman Hoss Man added a comment - I have no idea if other parts of the tutorial are totally broken, but my biggest question first and foremost is WTF is up with "strings" being the defaultFieldType in data_driven_schema_configs???? that makes no sense to me at all. It appears it's been that way since commit 0ff1e75b for SOLR-6779 , but there's no explanation i can see in that jira as to why this change was made. Has this really been broken this horiffically since 5.0?!?!?!

            People

            • Assignee:
              janhoy Jan Høydahl
              Reporter:
              hossman Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development